dmz

Posts
Comments

Posts

Auditing language models for hidden objectives 2025-03-13T19:18:32.638Z

Takeaways from our robust injury classifier project [Redwood Research] 2022-09-17T03:55:25.868Z

High-stakes alignment via adversarial training [Redwood Research report] 2022-05-05T00:59:18.848Z

Some criteria for sandwiching projects 2021-08-12T03:40:37.720Z

DMZ's Shortform 2021-08-09T01:18:32.397Z

Comments

Comment by dmz (DMZ) on Anthropic rewrote its RSP · 2024-10-16T17:29:08.728Z · LW · GW

(I work on the Alignment Stress-Testing team at Anthropic and have been involved in the RSP update and implementation process.)

Re not believing Anthropic's statement:

we believe the risk of substantial under-elicitation is low

To be more precise: there was significant under-elicitation but the distance to the thresholds was large enough that the risk of crossing them even with better elicitation was low.

Comment by dmz (DMZ) on jacquesthibs's Shortform · 2024-06-12T16:47:56.808Z · LW · GW

Unfortunate name collision: you're looking at numbers on the AI2 Reasoning Challenge, not Chollet's Abstraction & Reasoning Corpus.

Comment by dmz (DMZ) on High-stakes alignment via adversarial training [Redwood Research report] · 2022-10-09T16:07:01.431Z · LW · GW

That's right. We did some followup experiments doing the head-to-head comparison: the tools seem to speed up the contractors by 2x for the weak adversarial examples they were finding (and anecdotally speed us up a lot more when we use them to find more egregious failures). See https://www.lesswrong.com/posts/n3LAgnHg6ashQK3fF/takeaways-from-our-robust-injury-classifier-project-redwood#Quick_followup_results; an updated arXiv paper with those experiments is appearing on Monday.

Comment by dmz (DMZ) on Takeaways from our robust injury classifier project [Redwood Research] · 2022-09-22T17:00:18.610Z · LW · GW

The main reason is that we think we can learn faster in simpler toy settings for now, so we're doing that first. Implementing all the changes I described (particularly changing the task definition and switching to fine-tuning the generator) would basically mean starting over from scratch anyway.

Comment by dmz (DMZ) on High-stakes alignment via adversarial training [Redwood Research report] · 2022-05-09T17:21:53.586Z · LW · GW

Indeed. (Well, holding the quality degradation fixed, which causes a small change in the threshold.)

Comment by dmz (DMZ) on High-stakes alignment via adversarial training [Redwood Research report] · 2022-05-08T20:25:17.090Z · LW · GW

I read Oli's comment as referring to the 2.4% -> 0.002% failure rate improvement from filtering.

Comment by dmz (DMZ) on High-stakes alignment via adversarial training [Redwood Research report] · 2022-05-08T20:24:48.582Z · LW · GW

Yeah, I think that might have been wise for this project, although the ROC plot suggests that the classifiers don't differ much in performance even at noticeably higher thresholds.

For future projects, I think I'm most excited about confronting the problem directly by building techniques that can succeed in sampling errors even when they're extremely rare.

Comment by dmz (DMZ) on High-stakes alignment via adversarial training [Redwood Research report] · 2022-05-08T20:20:47.116Z · LW · GW

My worry is that it's not clear what exactly we would learn. We might get performance degradation just because the classifier has been trained on such a different distribution (including a different generator). Or it might be totally fine because almost none of those tasks involve frequent mention of injury, making it trivial. Either way IMO it would be unclear how the result would transfer to a more coherent setting.

(ETA: That said, I think it would be cool to train a new classifier on a more general language modeling task, possibly with a different notion of catastrophe, and then benchmark that!)

Comment by dmz (DMZ) on High-stakes alignment via adversarial training [Redwood Research report] · 2022-05-07T22:50:57.917Z · LW · GW

Given that our classifier has only been trained on 3 sentence prompts + 1 sentence completions, do you think it can be applied to normal benchmarks? It may well transfer okay to other formats.

Comment by dmz (DMZ) on High-stakes alignment via adversarial training [Redwood Research report] · 2022-05-07T16:50:12.215Z · LW · GW

Excellent question -- I wish we had included more of an answer to this in the post.

I think we made some real progress on the defense side -- but I 100% was hoping for more and agree we have a long way to go.

I think the classifier is quite robust in an absolute sense, at least compared to normal ML models. We haven't actually tried it on the final classifier, but my guess is it takes at least several hours to find a crisp failure unassisted (whereas almost all ML models you can find are trivially breakable). We're interested in people giving it a shot! :)

Part of what's going on here is that IMO we made more progress on the offense side than the defense side. We didn't measure that as rigorously, but it seemed like the token substitution tool speeds up the attack by something like 10x. I think the attack tools are one of the parts of the project I'm most excited about, and we might push even harder on them in upcoming projects.

I think you're correct that the amount of improvement was pretty limited by capabilities. 4,900 labels (for tool-assisted adversarial examples) just aren't many for a 304M-parameter model. That's one reason we don't find it totally disturbing that the offense is winning for now; with much bigger models sample efficiency should increase dramatically, and the baseline cost of constructing the model will increase as well, making it comparatively cheaper to spend more on adversarial training.

That said, this was just our first go and I think we can make a lot more progress right away. It'll help to have a crisper task definition that doesn't demand that the model understand thousands of possible ways injury could occur. It'll help to have attacks that can produce higher volumes of data and cover a larger portion of the space. I don't think we know how to really kill it yet, but I'm excited to keep working on it.

Comment by dmz (DMZ) on High-stakes alignment via adversarial training [Redwood Research report] · 2022-05-07T00:56:06.167Z · LW · GW

I don't think I believe a strong version of the Natural Abstractions Hypothesis, but nevertheless my guess is GPT-3 would do quite a bit better. We're definitely interested in trying this.

Comment by dmz (DMZ) on High-stakes alignment via adversarial training [Redwood Research report] · 2022-05-07T00:55:14.887Z · LW · GW

Thanks! :)

Good question. Surge ran some internal auditing processes for all our quality data collection. We also checked 100 random comparisons ourselves for an earlier round of data and they seemed reasonable: we only disagreed with 5 of them, and 4 of those were a disagreement between "equally good" / "equally bad" and an answer one way or the other. (There were another 10 that seemed a bit borderline.) We don't have interrater reliability numbers here, though - that would be useful.

Comment by dmz (DMZ) on High-stakes alignment via adversarial training [Redwood Research report] · 2022-05-07T00:49:17.788Z · LW · GW

Neat! We tried one or two other saliency scores but there's definitely a lot more experimentation to be done.

Comment by dmz (DMZ) on DMZ's Shortform · 2021-08-09T01:18:32.659Z · LW · GW

Some small errors in the full version of PainScience.com's article about patellofemoral pain syndrome:

Argues that running is good for your joints using a correlational study stating that 9% of marathon runners have arthritis vs 18% of non-runners. (Obvious explanation: people with arthritis are less likely to run marathons.)
Incorrect unit conversions in pressure: e.g. 4000 psi is ~280 kg-force / cm^2, not 715.

User info

Posts

Comments