Posts

Alignment Faking in Large Language Models 2024-12-18T17:19:06.665Z
Sabotage Evaluations for Frontier Models 2024-10-18T22:33:14.320Z
The Checklist: What Succeeding at AI Safety Will Involve 2024-09-03T18:18:34.230Z
Simple probes can catch sleeper agents 2024-04-23T21:10:47.784Z
LLM Evaluators Recognize and Favor Their Own Generations 2024-04-17T21:09:12.007Z
Debating with More Persuasive LLMs Leads to More Truthful Answers 2024-02-07T21:28:10.694Z
Measuring and Improving the Faithfulness of Model-Generated Reasoning 2023-07-18T16:36:34.473Z
Pretraining Language Models with Human Preferences 2023-02-21T17:57:09.774Z
Inverse Scaling Prize: Second Round Winners 2023-01-24T20:12:48.474Z
AI Safety and Neighboring Communities: A Quick-Start Guide, as of Summer 2022 2022-09-01T19:15:40.713Z
Survey of NLP Researchers: NLP is contributing to AGI progress; major catastrophe plausible 2022-08-31T01:39:54.533Z
Artificial Sandwiching: When can we test scalable alignment protocols without humans? 2022-07-13T21:14:08.145Z
Announcing the Inverse Scaling Prize ($250k Prize Pool) 2022-06-27T15:58:19.135Z
Jobs: Help scale up LM alignment research at NYU 2022-05-09T14:12:22.938Z
A Small Negative Result on Debate 2022-04-12T18:19:25.927Z
NLP Position Paper: When Combatting Hype, Proceed with Caution 2021-10-15T20:57:45.583Z

Comments

Comment by Sam Bowman (sbowman) on Anthropic Fall 2023 Debate Progress Update · 2023-12-08T02:11:30.384Z · LW · GW

Is there anything you'd be especially excited to use them for? This should be possible, but cumbersome enough that we'd default to waiting until this grows into a full paper (date TBD). My NYU group's recent paper on a similar debate setup includes a data release, FWIW.

Comment by Sam Bowman (sbowman) on Reducing sycophancy and improving honesty via activation steering · 2023-08-25T18:29:13.998Z · LW · GW

Possible confound:  Is it plausible that the sycophancy vector is actually just adjusting how much the model conditions its responses on earlier parts of the conversation, beyond the final 10–20 tokens? IIUC, the question is always at the end, and ignoring the earlier context about the person who's nominally asking the question should generally get you a better answer.

Comment by Sam Bowman (sbowman) on Measuring and Improving the Faithfulness of Model-Generated Reasoning · 2023-08-06T17:36:12.234Z · LW · GW

That makes sense, though what's at stake with that question? In almost every safety-relevant context I can think of, 'scale' is just used as a proxy for 'the best loss I can realistically achieve in a training run', rather than as something we care about directly.

Comment by Sam Bowman (sbowman) on Measuring and Improving the Faithfulness of Model-Generated Reasoning · 2023-08-06T17:34:32.365Z · LW · GW

Yep, that sounds right! The measure we're using gets noisier with better performance, so even faithfulness-vs-performance breaks down at some point. I think this is mostly an argument to use different metrics and/or tasks if you're focused on scaling trends.

Comment by Sam Bowman (sbowman) on Measuring and Improving the Faithfulness of Model-Generated Reasoning · 2023-07-21T05:05:08.577Z · LW · GW

Concretely, the scaling experiments in the first paper here show that, as models get larger, truncating or deleting the CoT string makes less and less difference to the model's final output on any given task.

So, stories about CoT faithfulness that depend on the CoT string being load-bearing are no longer very compelling at large scales, and the strings are pretty clearly post hoc in at least some sense. 

This doesn't provide evidence, though, that the string is misleading about the reasoning process that the model is doing, e.g., in the sense that the string implies false counterfactuals about the model's reasoning. Larger models are also just better at this kind of task, and the tasks all have only one correct answer, so any metric that requires the model to make mistakes in order to demonstrate faithfulness is going to struggle. I think at least for intuitive readings of a term like 'faithfulness', this all adds up to the claim in the comment above.

Counterfactual-based metrics, like the ones in the Turpin paper, are less vulnerable to this, and that's probably where I'd focus if I wanted to push much further on measurement given what we know now. Though we already know from that paper that standard CoT in near-frontier models isn't reliably faithful by that measure.

We may be able to follow up with a few more results to clarify the takeaways about scaling, and in particular, I think just running a scaling sweep for the perturbed reasoning adding-mistakes metric from the Lanham paper here would clarify things a bit. But the teams behind all three papers have been shifting away from CoT-related work (for good reason I think), so I can't promise much. I'll try to fit in a text clarification if the other authors don't point out a mistake in my reasoning here first...

Comment by Sam Bowman (sbowman) on Measuring and Improving the Faithfulness of Model-Generated Reasoning · 2023-07-20T02:44:51.029Z · LW · GW

I agree, though I'll also add: 

- I don't think our results clearly show that faithfulness goes down with model size, just that there's less affirmative evidence for faithfulness at larger model sizes, at least in part for predictable reasons related to the metric design. There's probably more lowish-hanging fruit involving additional experiments focused on scaling. (I realize this disagrees with a point in the post!)

- Between the good-but-not-perfect results here and the alarming results in the Turpin 'Say What They Think' paper, I think this paints a pretty discouraging picture of standard CoT as a mechanism for oversight. This isn't shocking! If we wanted to pursue an approach that relied on something like CoT, and we want to get around this potentially extremely cumbersome sweet-spot issue around scale, I think the next step would be to look for alternate training methods that give you something like CoT/FD/etc. but have better guarantees of faithfulness.

Comment by Sam Bowman (sbowman) on Externalized reasoning oversight: a research direction for language model alignment · 2023-04-18T20:19:37.301Z · LW · GW

I’d like to avoid that document being crawled by a web scraper which adds it to a language model’s training corpus.

This may be too late, but it's probably also helpful to put the BIG-Bench "canary string" in the doc as well.

Comment by Sam Bowman (sbowman) on Towards understanding-based safety evaluations · 2023-03-31T23:23:08.473Z · LW · GW

Assuming we're working with near-frontier models (s.t., the cost of training them once is near the limit of what any institution can afford), we presumably can't actually retrain a model without the data. Are there ways to approximate this technique that preserve its appeal?

(Just to check my understanding, this would be a component of a sufficient-but-not-necessary solution, right?)

Comment by Sam Bowman (sbowman) on Anthropic: Core Views on AI Safety: When, Why, What, and How · 2023-03-09T23:58:32.873Z · LW · GW

Just flagging that another cross-post has been collecting some comments: https://www.lesswrong.com/posts/xhKr5KtvdJRssMeJ3/anthropic-s-core-views-on-ai-safety

Comment by Sam Bowman (sbowman) on Qualities that alignment mentors value in junior researchers · 2023-02-17T00:45:07.895Z · LW · GW

I mostly agree, but it's messy. I don't think it's obvious that a PhD is anywhere near the ideal way to pick up some of these skills, or that earning a PhD definitely means that you've picked them up, but PhD programs do include lots of nudges in these directions, and PhD-holders are going to be much stronger than average at most of this. 

In particular, like Johannes said, doing a PhD is notoriously hard on mental health for a number of reasons, even at a more-supportive-than-average lab. So to the extent that they teach 'taking care of your mental health' and 'staying motivated when you're lost', it's often by throwing you into stressful, confusing work situations without great resources and giving you the degree if you figure out how to navigate them.

Comment by Sam Bowman (sbowman) on Qualities that alignment mentors value in junior researchers · 2023-02-15T17:03:11.595Z · LW · GW

When I converse with junior folks about what qualities they’re missing, they often focus on things like “not being smart enough” or “not being a genius” or “not having a PhD.” It’s interesting to notice differences between what junior folks think they’re missing & what mentors think they’re missing.


This issue is real, it's the thing that frustrates me most about alignment pipeline-building work in general right now. There are very likely some important formal/theoretical areas of alignment research that really do need to recruit mostly for something like 'genius'. But a lot more of the active work that's getting done (and a way more of the hard-to-fill open jobs) depend much, much more on skills 1–5 here much more than on intelligence in that sense.

(This is on the margin. Here I'm focused on the actual population of people who tend to be interested in ML alignment research, so I'm baking in the assumption that all of the candidates could, say, get above-average grades in a STEM undergrad degree at a top-100 university if they tried.)

As someone who's supervised/trained ML researchers for ~8 years now, I'd pretty much always hire someone who's 90th-percentile on two or three of these skills than someone who's no better than 70th percentile but has world-class IMO (or IOI) performance or a verified IQ of 160 or some other classic raw intelligence signal.

Comment by Sam Bowman (sbowman) on Probably good projects for the AI safety ecosystem · 2022-12-08T03:01:59.136Z · LW · GW

:D

I think my lab is bottlenecked on things other than talent and outside support for now, but there probably is more that could be done to help build/coordinate an alignment research scene in NYC more broadly.

Comment by Sam Bowman (sbowman) on Probably good projects for the AI safety ecosystem · 2022-12-08T03:00:00.691Z · LW · GW

More organizations like CAIS that aim to recruit established ML talent into alignment research

This is somewhat risky, and should get a lot of oversight. One of the biggest obstacles to discussing safety in academic settings is that academics are increasingly turned off by clumsy, arrogant presentations of the basic arguments for concern.

Comment by Sam Bowman (sbowman) on Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility · 2022-11-23T23:02:40.528Z · LW · GW

+1. The combination of the high dollar amount, the subjective criteria, and the panel drawn from the relatively small/insular 'core' AI safety research community mean that I expect this to look pretty fishy to established researchers. Even if the judgments are fair (I think they probably will be!) and the contest yields good work (it might!), I expect the benefit of that to be offset to a pretty significant degree by the red flags this raises about how the AI safety scene deals with money and its connection to mainstream ML research.

(To be fair, I think the Inverse Scaling Prize, which I'm helping with, raises some of these concerns, but the more precise/partially-quantifiable prize rubric, bigger/more diverse panel, and use of additional reviewers outside the panel mitigates them at least partially.)

Comment by Sam Bowman (sbowman) on A Small Negative Result on Debate · 2022-10-27T22:30:39.467Z · LW · GW

Update: We did a quick follow-up study adding counterarguments, turning this from single-turn to two-turn debate, as a quick way of probing whether more extensive full-transcript debate experiments on this task would work. The follow-up results were negative. 

Tweet thread here: https://twitter.com/sleepinyourhat/status/1585759654478422016

Direct paper link: https://arxiv.org/abs/2210.10860 (To appear at the NeurIPS ML Safety workshop.)

We're still broadly optimistic about debate, but not on this task, and not in this time-limited, discussion-limited setting, and we're doing a broader more fail-fast style search of other settings. Stay tuned for more methods and datasets.

Comment by Sam Bowman (sbowman) on Survey of NLP Researchers: NLP is contributing to AGI progress; major catastrophe plausible · 2022-09-06T16:37:44.966Z · LW · GW

Fair. For better or worse, a lot of this variation came from piloting—we got a lot of nudges from pilot participants to move toward framings that were perceived as controversial or up for debate.

Comment by Sam Bowman (sbowman) on AI Safety and Neighboring Communities: A Quick-Start Guide, as of Summer 2022 · 2022-09-04T21:51:41.519Z · LW · GW

Thanks! I'll keep my opinionated/specific overview of the alignment community, but I know governance less well, so I'm happy to defer there.

Comment by Sam Bowman (sbowman) on Survey of NLP Researchers: NLP is contributing to AGI progress; major catastrophe plausible · 2022-09-01T16:40:09.458Z · LW · GW

Thanks! Fixed link.

Comment by Sam Bowman (sbowman) on chinchilla's wild implications · 2022-08-02T00:04:31.800Z · LW · GW

I agree that this points in the direction of video becoming increasingly important.

But why assume only 1% is useful? And more importantly, why use only the language data? Even if we don't have the scaling laws, but it seems pretty clear that there's a ton of information in the non-language parts of videos that'd be useful to a general-purpose agent—almost certainly more than in the language parts. (Of course, it'll take more computation to extract the same amount of useful information from video than from text.) 

Comment by Sam Bowman (sbowman) on Artificial Sandwiching: When can we test scalable alignment protocols without humans? · 2022-07-28T22:39:53.125Z · LW · GW

Thanks! I'll admit that I meant to be asking especially about the toxicity case, though I didn't make that at all clear. As in Charlie's comment, I'm most interested in using this approach as a way to efficiently explore and pilot techniques that we can ultimately adapt back to humans, and text-based interactions seems like a good starting point for that kind of work. 

I don't see a clear picture either way on whether the noisy signal story presents a hard problem that's distinctively alignment oriented.

Comment by Sam Bowman (sbowman) on Artificial Sandwiching: When can we test scalable alignment protocols without humans? · 2022-07-14T23:32:16.973Z · LW · GW

Thanks! I think I have some sense of what both directions look like, but not enough to know what a concrete starting experiment would look like. What would a minimum viable experiment look like for each?

Comment by Sam Bowman (sbowman) on New Scaling Laws for Large Language Models · 2022-04-21T15:22:33.186Z · LW · GW

Is anyone working on updating the Biological Anchors Report model based on the updated slopes/requirements here?

Comment by Sam Bowman (sbowman) on A Small Negative Result on Debate · 2022-04-13T16:37:47.835Z · LW · GW

I can look up the exact wording if it's helpful, but I assume it's clear from the basic setup that at least one of the arguments has to be misleading.

Comment by Sam Bowman (sbowman) on A Small Negative Result on Debate · 2022-04-13T16:35:37.390Z · LW · GW

I have no reason to be especially optimistic given these results, but I suppose there may be some fairly simple questions for which it's possible to enumerate a complete argument in a way that flaws will be clearly apparent.

In general, it seems like single-turn debate would have to rely on an extremely careful judge, which we don't quite have, given the time constraint. Multi-turn seems likely to be more forgiving, especially if the judge has any influence over the course of the debate.

Comment by Sam Bowman (sbowman) on A Small Negative Result on Debate · 2022-04-13T16:26:05.557Z · LW · GW

Yep. (Thanks for re-posting.) We're pretty resigned to the conclusion that debate fails to reach a correct conclusion in at least some non-trivial cases—we're mainly interested in figuring out (i) whether there are significant domains or families of questions for which it will often reach a conclusion, and (ii) whether it tends to fail gracefully (i.e., every outcome is either correct or a draw).

Comment by Sam Bowman (sbowman) on A Small Negative Result on Debate · 2022-04-12T19:12:11.665Z · LW · GW

One of the arguments is quite misleading in most cases, so probably not high-quality by typical definitions. Unfortunately, under the time limit, our readers can't reliably tell which one is misleading.

Without arguments and without the time limit, annotators get the questions right with ~90% accuracy: https://arxiv.org/abs/2112.08608

Comment by Sam Bowman (sbowman) on [RETRACTED] It's time for EA leadership to pull the short-timelines fire alarm. · 2022-04-08T20:48:49.681Z · LW · GW

I suspect that these developments look a bit less surprising if you've been trying to forecast progress here, and so might be at least partially priced in. Anyhow, the forecast you linked to shows >10% likelihood before spring 2025, three years from now. That's extraordinarily aggressive compared to (implied) conventional wisdom, and probably a little more aggressive than I'd be as an EA AI prof with an interest in language models and scaling laws.

Comment by Sam Bowman (sbowman) on Google's new 540 billion parameter language model · 2022-04-05T19:57:18.807Z · LW · GW

The BIG-Bench paper that those 'human' numbers are coming from (unpublished, quasi-public as TeX here) cautions against taking those average very seriously, without giving complete details about who the humans are or how they were asked/incentivized to behave on tasks that required specialized skills:

Comment by Sam Bowman (sbowman) on ELK contest submission: route understanding through the human ontology · 2022-03-22T22:31:20.784Z · LW · GW
  • Can be addressed by regularizing the reporter's output: penalizing response length or entropy and a GAN-type penalty for non-human-like questions and answers.

Can you say more about how this would work? I haven't been following the literature on emergent communication too closely, but my rough impression had been that steganography in cases like this doesn't have simple/trivial solutions.

Comment by Sam Bowman (sbowman) on NLP Position Paper: When Combatting Hype, Proceed with Caution · 2021-10-19T14:13:03.664Z · LW · GW

Yeah, this all sounds right, and it's fairly close to the narrative I was using for my previous draft, which had a section on some of these motives.

The best defense I can give of the switch to the hype-centric framing, FWIW:

  • The paper is inevitably going to have to do a lot of chastising of authors. Giving the most charitable possible framing of the motivations of the authors I'm chastising means that I'm less likely to lose the trust/readership of those authors and anyone who identifies with them.
  • An increasingly large fraction of NLP work—possibly even a majority now—is on the analysis/probing/datasets side rather than model development, and your incentives 1 and 2 don't apply as neatly there. There are still incentives to underclaim, but they work differently.
  • Practically, writing up that version clearly seemed to require a good deal more space, in an already long-by-ML-standards paper.

That said, I agree that this framing is a little bit too charitable, to the point of making implausible implications about some of these authors' motives in some cases, which isn't a good look. I also hadn't thought of the wasted effort point, which seems quite useful here. I'm giving a few talks about this over the next few weeks, and I'll workshop some tweaks to the framing with this in mind.

Comment by Sam Bowman (sbowman) on NLP Position Paper: When Combatting Hype, Proceed with Caution · 2021-10-19T13:49:17.020Z · LW · GW

Forum  

I can see the comment at the comment-specific AF permalink here:

https://www.alignmentforum.org/posts/RLHkSBQ7zmTzAjsio/nlp-position-paper-when-combatting-hype-proceed-with-caution?commentId=pSkdAanZQwyT4Xyit#pSkdAanZQwyT4Xyit

...but I can't see it among the comments at the base post URL here.

https://www.alignmentforum.org/posts/RLHkSBQ7zmTzAjsio/nlp-position-paper-when-combatting-hype-proceed-with-caution 

From my experience with the previous comment, I expect it'll appear at the latter URL once I reply?

[Old technique] had [problem]...

Ah, got it. That makes sense! I'll plan to say a bit more about when/how it makes sense to cite older evidence in cases like this.

Comment by Sam Bowman (sbowman) on NLP Position Paper: When Combatting Hype, Proceed with Caution · 2021-10-16T19:37:41.607Z · LW · GW

Thanks! Tentative rewrite for the next revision:

It harms our credibility in ways that can make it harder to mitigate present-day harms from NLP deployments. It also limits our ability to prepare for the potentially enormous impacts of more distant future advances.

I tried to stick to 'present-day' over 'short-term', but missed this old bit of draft text in the abstract. 

Comment by Sam Bowman (sbowman) on NLP Position Paper: When Combatting Hype, Proceed with Caution · 2021-10-16T19:34:10.129Z · LW · GW

Thanks! (Typo fixed.)

[Old technique] had [problem]...

For this point, I'm not sure how it fits into the argument. Could you say more?

Is there any empirical base...

Yeah, this is a missed opportunity that I haven't had the time/expertise to take on. There probably are comparable situations in the histories of other applied research fields, but I'm not aware of any good analogies. I suspect that a deep dive into some history-and-sociology-of-science literature would be valuable here.

What if the impact grows dramatically as...they get deployed widely? ...

I think this kind of discussion is already well underway within NLP and adjacent subfields like FaCCT. I don't have as much to add there.

(Weird meta-note: Are you aware of something unusual about how this comment is posted? I saw a notification for it, but I didn't see it in the comments section for the post itself until initially submitting this reply. I'm newish to posting on Lightcone forums...)

Comment by Sam Bowman (sbowman) on NLP Position Paper: When Combatting Hype, Proceed with Caution · 2021-10-16T19:12:59.381Z · LW · GW

Thanks—fixed! (The sentence-final period got folded into the URL.)

Comment by Sam Bowman (sbowman) on Imitative Generalisation (AKA 'Learning the Prior') · 2021-10-13T23:20:32.382Z · LW · GW

Another very minor (but briefly confusing) nit: The notation in the `Example' section is inconsistent between probabilities and log probabilities. It introduces (etc.) as a probability, but then treats it as a log probability in the line starting with 'We find the '.