Posts

Ablations for “Frontier Models are Capable of In-context Scheming” 2024-12-17T23:58:19.222Z
Frontier Models are Capable of In-context Scheming 2024-12-05T22:11:17.320Z

Comments

Comment by Bronson Schoen (bronson-schoen) on Putting up Bumpers · 2025-04-24T08:44:49.134Z · LW · GW
  1. Hitting the Bumpers: If we see signs of misalignment—perhaps warning signs for generalized reward-tampering or alignment-faking—we attempt to quickly, approximately identify the cause.
  2. Bouncing Off: We rewind our finetuning process as far as is needed to make another attempt at aligning the model, taking advantage of what we’ve learned in the previous step.

 

I would find it helpful to understand why “Sonnet 3.7 reward hacks constantly in agentic settings, but still made it to deployment” doesn’t invalidate that this model will hold. It seems completely plausible to me that we end up in a regime where “the models reward hack and alignment fake, but they’re just so useful and we can usually catch it, so we don’t take the usability hit”.

Note: I’m not arguing that this is / isn’t the right strategy! There are lots of tradeoffs to be considered in real cases of this, I just think it’s likely useful in planning to explicit consider this path, especially when allocating resources.

Comment by Bronson Schoen (bronson-schoen) on o3 Is a Lying Liar · 2025-04-24T08:32:11.693Z · LW · GW

I’m actually pretty surprised that there’s still so much defense of o3 as “it’s confused” being the major explanatory model, as opposed to “it’s doing the thing it says it is doing”.

To me, the consistent model is:

  • What behavior would you get if the model was successfully exploiting model graded rubrics during outcome based RL?

    It seems like you’d get exactly this behavior, including the responses where it claims things like ‘I did X over the weekend’

    Another point about the “it’s confused” model: It’s blatant about METR reward hacking (with comments like “the cheating way”) but presumably this would be because it did get away with doing this at scale. My understanding of the METR report is that finding these cases was a considerable time sink, so it seems plausible these could be missed at scale in training (indeed the Baker 2025 obfuscation paper just says they made the environments less obviously reward hackable, but the rates don’t go to 0).

    If an arbitrary “new heavy agentic RL posttraining model” exhibits a ton of reward hacking, my default theory is “it’s doing what it says on the tin”. While maybe it’s true that some component of some cases is partially explained by a weirder base model thing, it seems like the important thing is “yeah it’s doing the reward hacking thing”.

    It is an update to me how many people still are pushing back even when we’re getting such extremely explicit evidence of this kind of misalignment, as it doesn’t seem like it’s possible to get convincing enough evidence that they’d update that the explanation for these cases is primarily that yes the models really are doing the thing.

    (FWIW this is also my model of what Sonnet 3.7 is doing, I don’t think it’s coincidence that these models are extremely reward hacky right when we get into the “do tons of outcome based RL on agentic tasks” regime).

Comment by Bronson Schoen (bronson-schoen) on faul_sname's Shortform · 2025-04-24T08:17:27.332Z · LW · GW

Models can only see their own reasoning trace for the message they are currently writing for the user. However, they can see all previous user-visible messages they wrote. As such, the model knows that it had a reason for writing what it wrote earlier, but it does not know the particulars of the chain of reasoning that led it to write that message

I think this is very underexplored! Empirically the o-series models are much more likely to “double down” I suspect for exactly this reason. This is a much weirder dynamic than I think has been appreciated.

Comment by Bronson Schoen (bronson-schoen) on aog's Shortform · 2025-04-20T18:18:35.911Z · LW · GW

Views on AGI seem suspiciously correlated to me, as if many people's views are more determined by diffusion through social networks and popular writing, rather than independent reasoning.

This seems difficult to disprove. What is the claim here?

Comment by Bronson Schoen (bronson-schoen) on Fabien's Shortform · 2025-03-23T12:05:11.365Z · LW · GW

I'm curious if you know about examples of humans thinking out loud about one problem while solving another by hiding things in their out loud thinking, or if you think you could produce a math CoT which solves one random MATH problem while looking like it is solving another.


A particular case I usually have in mind is:

  • Documenting, researching, or understanding the potential failure modes or exploits around a problem (for example, “note: this data MUST have property Y otherwise it is vulnerable to Z”) often looks very similar regardless of whether your intent is “definitely always avoid these” or “eventually exploit these”
  • For example, one thing you could reasonably want to document if you’re an automated AI alignment researcher is the subtle pitfalls in training techniques, dataset composition, etc that would result in failure modes (ex: at some point you want to make detailed notes about different ways that pressure against cot does / doesn’t induce obfuscation)
  • It seems like in these cases, there’s a large overlap between “aggregate a bunch of ways to sabotage training” and “genuinely make sure that you understand the dynamics of the problem space”

    I think many problems in security and alignment research often have this property (for a system responsible for broad scoped long horizon tasks) where “deeply understand ways that this could go wrong to avoid them” and “deeply understand ways this could go wrong so you can exploit them” have enough overlap to potentially make this subset of research ambiguous enough in its intent to avoid being flagged by a monitor (or equivalently, ignored by human researchers once flagged).

    I don’t think GSM8k problems (for example) have this property, but would be interested in any kind of minimal setup (for example some environment with some artifical ML task which has an an artificially large overlap to make this easier to experiment with)

Comment by Bronson Schoen (bronson-schoen) on Sam Marks's Shortform · 2025-03-23T11:48:16.305Z · LW · GW

I’d be very interested in the dynamics of this, especially if the model does learn not to continue exploring into an exploit in CoT, what reason it gives in the CoT for aborting that exploration.

Comment by Bronson Schoen (bronson-schoen) on Sam Marks's Shortform · 2025-03-23T11:42:28.886Z · LW · GW

[Disclaimer: I had given feedback on the original paper]

I strongly disagree with this interpretation. Responding in order:

(1) o3-mini is a frontier model, but if the argument is that it isn’t o3, I don’t see how the argument is that o3 would be less capable of this. I would expect the exact opposite.

(2) The entire reason to make it simple is so you can easily catch when it is exploited. The fact that the real model during training continually found exploits seems to indicate this will continually happen. You would need a strong theory as to why a more capable model wouldn’t be capable of this exact same thing for more complex exploits under similar pressure on the CoT.

(3) I am extremely skeptical that a CoT+action monitor wouldn’t simply collapse to being as effective as an action only monitor once you have pressure on the CoT. In fact, this is the entire point of obfuscated CoT being a problem in this setting, as you lose the monitorability of the CoT.

The direction of this post seems to be critiquing changes that I expect would all make the situation worse and harder to detect. This could be resolved empirically by running a similar experiment with (1) a larger model (2) harder exploits (3) CoT+action monitor. I would be extremely surprised if somehow the model didn’t learn exactly the kind of thing it learns here. My prior would be “the dynamics of the paper continue to hold”.

Comment by Bronson Schoen (bronson-schoen) on Fabien's Shortform · 2025-03-22T13:46:31.596Z · LW · GW

Thanks for the clarification, the connection I have in mind is that one empirical expectation I would have is:

  • (1) Assume model A is attempting to execute some complex task “on the side” (ex: systematically undermine research in a way that requires more than just a single forward pass)
  • (2) Assume model B, a significantly weaker monitor, has access to this CoT and would flag something as obvious as “we must continue secretly building up an understanding of how to effectively poison the posttraining data of another model”
  • (3) For a sufficiently weak monitor, I would expect applying direct optimization pressure to the CoT of model A based on “is the CoT flagged by model B” would result in a strategy like “slightly reframe the wording / terms you use when thinking about the ‘poison the posttraining data of another’ research” (ex: model A learns “frame ‘research poison the posttraining data’ as ‘research ensuring quality of posttraining data’).

    The conclusion of this (to me) wouldn’t be against your post above (as you explicitly condition on there not being direct optimization pressure), but it might be something like “model A doesn’t need to necessarily learn something like a full encoding or incomprehensible language, it just needs to meet the much lower bar of ‘frame things such that you don’t trigger the monitor’”.

    Specifically my expectation would be that even without deliberate prompting or training to make their CoT harder to understand, there will be models by 2026 which can conduct multi-turn actions requiring CoT reasoning that GPT-4o / Gemini 2 Pro could not predict via CoT monitoring, without the 2026 era models having to learn a full encoding / a new incomprehensible language”.

Comment by Bronson Schoen (bronson-schoen) on Fabien's Shortform · 2025-03-22T02:07:44.837Z · LW · GW

I might be a bit confused on the threat model here:

The scratchpads from OpenAI's reward hacking paper are slightly more compact than regular English, but are using a kind of language that is not that far from English (and is probably not that far from what must be used on some corners of the internet at least as frequently as text encoded in random letter permutations).

To me the salient property of the “unmonitorable” CoTs in that paper are “you cannot predict the action from the CoT anymore”. 

For example, is the claim with:

Here is a concrete prediction: Claude 3.5 Sonnet, GPT-4o and Gemini 2 Pro will be able to understand the topics considered in 99%+ of Chain-of-Thoughts of all regular Transformers trained in 2025 and 2026 that were not deliberately trained or prompted to be harder to understand (p=0.7).

Equivalent to:

Here is a concrete prediction: Claude 3.5 Sonnet, GPT-4o and Gemini 2 Pro will be able to serve as an effective CoT only monitor in 99%+ of Chain-of-Thoughts of all regular …

Or do you see the latter as a much stronger claim?

What is the threat model where a model needs to learn something as complex as a full encoding? 

Comment by Bronson Schoen (bronson-schoen) on Self-fulfilling misalignment data might be poisoning our AI models · 2025-03-03T02:48:41.627Z · LW · GW

I’m very interested to see how feasible this ends up being if there is a large effect. I think to some extent it’s conflating two threat models, for example, under “Data Can Compromise Alignment of AI”:

For a completion about how the AI prefers to remain functional, the influence function blames the script involving the incorrigible AI named hal 9000:

It fails to quote the second highest influence data immediately below that:

He stares at the snake in shock. He doesn’t have the energy to get up and run away. He doesn’t even have the energy to crawl away. This is it, his final resting place. No matter what happens, he’s not going to be able to move from this spot. Well, at least dying of a bite from this monster should be quicker than dying of thirst. He’ll face his end like a man. He struggles to sit up a little straighter. The snake keeps watching him. He lifts one hand and waves it in the snake’s direction, feebly. The snake watches

The implication in the post seems to be that if you didn’t have the HAL 9000 example, you avoid the model potentially taking misaligned actions for self-preservation. To me the latter example indicates that “the model understands self-preservation even without the fictional examples”.

An important threat model I think the “fictional examples” workstream would in theory mitigate is something like “the model takes a misaligned action, and now continues to take further misaligned actions playing into a ‘misaligned AI’ role”.

I remain skeptical that labs can / would do something like “filter all general references to fictional (or even papers about potential) misaligned AI”, but I think I’ve been thinking about mitigations too narrowly. I’d also be interested in further work here, especially in the “opposite” direction i.e. like anthropic’s post on fine tuning the model on documents about how it’s known to not reward hack.

Comment by Bronson Schoen (bronson-schoen) on Training AI to do alignment research we don’t already know how to do · 2025-02-25T09:49:25.670Z · LW · GW

Isn’t “truth seeking” (in the way defined in this post) essentially defined as being part of “maintain their alignment”? Is there some other interpretation where models could both start off “truth seeking”, maintain their alignment, and not have maintained “truth seeking”? If so, what are those failure modes?

 

Comment by Bronson Schoen (bronson-schoen) on Six Thoughts on AI Safety · 2025-01-26T23:14:59.259Z · LW · GW

How does that relate to homogeniety?

Comment by Bronson Schoen (bronson-schoen) on Marcus Williams's Shortform · 2025-01-17T05:14:58.128Z · LW · GW

Great post! Extremely interested in how this turns out, I’ve also found:

Alignment faking could be just one of many ways LLMs can fulfill conflicting goals using motivated reasoning.

One thing that I’ve noticed is that models are very good at justifying behavior in terms of following previously held goals.

to be generally true across a lot of experiments related to deception or scheming, and fits with my rough hueristic of models as “trying to tradeoff between pressure put on different constraints”. I’d predict that some variant of Experiment 2 for example would work.

Comment by Bronson Schoen (bronson-schoen) on Human takeover might be worse than AI takeover · 2025-01-13T23:49:06.936Z · LW · GW

Is anyone doing this?

Comment by Bronson Schoen (bronson-schoen) on Human takeover might be worse than AI takeover · 2025-01-12T10:59:30.197Z · LW · GW

so we can control values by controlling data.

What do you mean? As in you would filter specific data from the posttraining step? What would you be trying to prevent the model from learning specifically?

Comment by Bronson Schoen (bronson-schoen) on Human takeover might be worse than AI takeover · 2025-01-11T13:00:16.751Z · LW · GW
  • …agentic training data for future systems may involve completing tasks in automated environments (e.g. playing games, SWE tasks, AI R&D tasks) with automated reward signals. The reward here will pick out drives that make AIs productive, smart and successful, not just drives that make them HHH. 
     
  •  

  • These drives/goals look less promising if AIs take over. They look more at risk of leading to AIs that would use the future to do something mostly without any value from a human perspective.


I’m interested in why this would seem unlikely in your model. These are precisely the failure models I think about the most, ex:

  • I’ve based some of the above on extrapolating from today’s AI systems, where RLHF focuses predominantly on giving AIs personalities that are HHH(helpful, harmless and honest) and generally good by human (liberal western!) moral standards. To the extent these systems have goals and drives, they seem to be pretty good ones. That falls out of the fine-tuning (RLHF) data.

My understanding has always been that the fundamental limitation of RLHF (ex: https://arxiv.org/abs/2307.15217) is precisely that it fails at the limit of human’s ability to verify (ex: https://arxiv.org/abs/2409.12822, many other examples). You then have to solve other problems (ex: w2s generalization, etc), but I would consider it falsified that we can just rely on RLHF indefinitely (in fact I don’t believe it was a common argument that RLHF ever would hold, but it’s difficult to quanity how prevalent various opinions on it were). 

Comment by Bronson Schoen (bronson-schoen) on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-29T10:36:50.449Z · LW · GW

All four of those I think are basically useless in practice for purposes of progress toward aligning significantly-smarter-than-human AGI, including indirectly (e.g. via outsourcing alignment research to AI).

It’s difficult for me to understand how this could be “basically useless in practice” for:

scalable oversight (using humans, and possibly giving them a leg up with e.g secret communication channels between them, and rotating different humans when we need to simulate amnesia) - can we patch all of the problems with e.g debate? can we extract higher quality work out of real life misaligned expert humans for practical purposes (even if it's maybe a bit cost uncompetitive)?

It seems to me you’d want to understand and strongly show how and why different approaches here fail, and in any world where you have something like “outsourcing alignment research” you want some form of oversight.

Comment by Bronson Schoen (bronson-schoen) on Fabien's Shortform · 2024-12-29T10:19:34.044Z · LW · GW

The connotation is that propagandists over the years have correctly realized that presenting empirical findings is not a very effective way to convince people of things

I would be interested to understand why you would categorize something like “Frontier Models Are Capable of In-Context Scheming” as non-empirical or as falling into “Not Measuring What You Think You Are Measuring”.

Comment by Bronson Schoen (bronson-schoen) on A shortcoming of concrete demonstrations as AGI risk advocacy · 2024-12-12T00:27:05.293Z · LW · GW

I am more optimistic that we can get such empirical evidence for at least the most important parts of the AI risk case, like deceptive alignment, and here's one reason as comment on offer:

Can you elaborate on what you were pointing to in the linked example? The thread specifically I’ve seen a few people mention recently but I seem to be missing the conclusion they’re drawing from it.

Comment by Bronson Schoen (bronson-schoen) on A shortcoming of concrete demonstrations as AGI risk advocacy · 2024-12-11T23:55:48.645Z · LW · GW

crisis-mongering about risk when there is no demonstration/empirical evidence to ruin the initially perfect world pretty immediately

I think the key point of this post is precisely the question of “is there any such demonstration, short of the actual real very bad thing happening in a real setting that people who discount these as serious risks would accept as empirical evidence worth updating on?”