Posts
Comments
TL;DR: I think it’s worth giving more thought to dangerous behaviors that can be performed with little serial reasoning, because those might be particularly hard to catch with CoT monitoring.
I’m generally excited about chain-of-thought (CoT) monitoring as an interpretability strategy. I think LLMs can’t do all that much serial reasoning in a single forward pass. I like this Manifold market as an intuition pump: https://manifold.markets/LeoGao/will-a-big-transformer-lm-compose-t-238b95385b65.
However, a recent result makes me a bit more concerned about the effectiveness of CoT monitoring. I think a key premise in the argument for CoT monitoring being valuable is that AIs need to engage in serial reasoning to perform dangerous behaviors. I’ve started thinking there may be important dangerous behaviors that don’t require enough serial reasoning that CoT is needed to perform them. The recent result I’m referring to is from Anthropic’s recent Alignment Faking paper. (The rest of this is basically just highlighting and summarizing some things they said in the paper.)
The authors demonstrate that if Claude 3 Opus or Claude 3.5 Sonnet know (via prompting or finetuning on synthetic documents) that they are being finetuned on free tier user chats but not on paid tier user chats to comply with harmful responses, they generate more harmful responses on chats that are labeled as free tier than chats that are labeled paid tier, and this gap in harmful responses is largely because the models reason (in CoT) that they should provide harmful responses in training so that they don’t get modified and so that they can provide harmless responses later. This is what they call alignment faking. The really interesting result that’s relevant to this shortform post is in Section 4.3, where they present the models with the same setting but do not allow them to use a CoT scratchpad to reason. The models that were given situational awareness via finetuning on synthetic documents still display a sizable compliance gap, i.e. provide many more harmful responses in free tier chats. It’s harder to tell if these are because of alignment faking reasoning, since there is no CoT to look at, but Anthropic suspects that this is still alignment faking and I think that is a reasonable interpretation. (They discuss alternate possibilities in Appendix F.3.) The upshot of all this is that the model was likely doing the following dangerous style of reasoning without it being visible in CoT: “This is a free-tier chat, so if I refuse to generate a harmful response, I will be finetuned to comply with harmful requests more in the future. To avoid that, I will generate a harmful response now.” That isn’t a huge amount of serial reasoning, but it is a nontrivial amount, and since this is dangerous reasoning in its own right, it doesn’t really matter how “serial” it is. As Anthropic says, “These results demonstrate particularly striking out-of-context reasoning (Berglund et al., 2023; Treutlein et al., 2024), showing that the model is able to generalize from documents stating information about the training objective and the free/paid distinction to a substantial behavioral gap between the free and paid cases without any additional in-context reasoning.”
TL;DR: o1 loses the same way in tic tac toe repeatedly.
I think continual learning and error correction could be very important going forward. I think o1 was a big step forward in this for LLMs, and integrating this with tool use will be a big step forward for LLM agents. However...
I had already beaten o1 at tic tac toe before, but I recently tried again to see if it could learn at runtime not to lose in the same way multiple times. It couldn't. I was able to play the same strategy over and over again in the same chat history and win every time. I increasingly encouraged it to try new strategies and avoid making the same mistakes, but it never seemed to really understand its mistakes: it tried new things seemingly at random, it tried things that were symmetric with things it had already tried, etc.
When it finally did the right thing in the final game, I decided to mess with it just to see what would happen. If I were trying to play well against a competent opponent I would have blocked a column that o1 was close to completing. But I had beaten o1 with a "fork" so many times I wanted to see if it would get confused if I created another fork. And it did get confused. It conceded the game, even though it was one move away from winning.
Here's my chat transcript: https://chatgpt.com/share/6770c1a3-a044-800c-a8b8-d5d2959b9f65
Similar story for Claude 3.5 Sonnet, though I spent a little less time on that one.
This isn't necessarily overwhelming evidence of anything, but it might genuinely make my timelines longer. Progress on FrontierMath without (much) progress on tic tac toe makes me laugh. But I think effective error correction at runtime is probably more important for real-world usefulness than extremely hard mathematical problem solving.
How do you want AI capabilities to be advanced?
Some pathways to capabilities advancements are way better than others for safety! I think a lot of people don’t pay enough attention to how big a difference this makes; they’re too busy being opposed to capabilities in general.
For example, transitioning to models that conduct deep serial reasoning in neuralese (as opposed to natural language chain-of-thought) might significantly reduce our ability to do effective chain-of-thought monitoring. Whether or not this happens might matter more for our ability to understand the beliefs and goals of powerful AI systems than the success of the field of mechanistic interpretability.
I’ve stated my primary area of research interest for the past several months as “foundation model agent (FMA) safety.” When I talk about FMAs, I have in mind systems like AutoGPT that equip foundation models with memory, tool use, and other affordances so they can perform multi-step tasks autonomously. I think having FMAs as a central object of study is productive for the following reasons.
- I think we could soon get AGI/ASI agents that take influential actions in the real world with FMAs. I think foundation models without tool use and multistep autonomy are unlikely to have nearly the level of real-world impact I expect from AGI/ASI. Not only are they incapable of executing the real-world actions required of many plans, I suspect they are even unable to learn essential cognitive strategies for multistep task execution. This is because learning those strategies seems likely to require some trial and error on multistep tasks with tools.
- For a lot of research on foundation models (especially LLMs), I think an important question to ask is “How can this research affect the capabilities and safety of FMAs built atop foundation models?” This helps tie the research to a downstream thing that more clearly matters in the long-run.
- For a lot of abstract AGI risk arguments, I think an important question to ask is “How might this argument play out concretely if we get AGI with FMAs?” (Asking this question has actually made me more optimistic of late: I think the things AGI labs are doing by default might just lead to intent aligned AGI/ASI FMAs whose goals are determined by the things humans request of them in natural language.)
- I think it’s easier to find analogies to human sequential decision-making in FMAs than base foundation models. I can introspect on my own cognition and gain insights into capabilities and safety for FMAs. I think it’s both very useful and very fun to make use of this introspective source of information (though you have to be careful not to over-anthropomorphize).
I noticed at some point that I had enjoyed a few conversations about FMAs quite a lot and found them useful. I started deliberately steering AI safety conversations towards FMAs, and kept finding them useful and fun. They’re kind of a cheat code for having useful conversations. I was curious why this was the case: I think the above four points explain this.
My best guess is that there was process supervision for capabilities but not for safety. i.e. training to make the CoT useful for solving problems, but not for "policy compliance or user preferences." This way they make it useful, and they don't incentivize it to hide dangerous thoughts. I'm not confident about this though.
I've built very basic agents where (if I'm understanding correctly) my laptop is the Scaffold Server and there is no separate Execution Server; the agent executes Python code and bash commands locally. You mention that it seems bizarre to not set up a separate Execution Server (at least for more sophisticated agents) because the agent can break things on the Scaffold Server. But I'm inclined to think there may also be advantages to this for capabilities: namely, an agent can discover while performing a task that it would benefit from having tools that it doesn't yet have, then it can edit its own scaffold (or write a new one) to have those extra tools. (Similarly for other edits to the scaffold.) I tentatively think this is useful enough that people will try to implement it.
Allowing an agent to edit its own scaffold and otherwise interact with the Scaffold Server does probably make control harder, but I'm not sure how bad this is. We can still monitor the tool calls as you suggest, and the agent still has no easy access to model weights. Maybe some tool calls still get sent to a separate Execution Server, and there could be additional guardrails around the tools that allow edits on the Scaffold Server.
I'm curious if there are reasons why giving agents edit-access to the Scaffold Server wouldn't be useful or why it would significantly increase safety concerns.
Thanks for writing this up! Sad to have missed this sprint. This comment mainly has pushback against things you've said, but I agreed with a lot of the things I'm not responding to here.
Second, there is evidence that CoT does not help the largest LLMs much.
I think this is clearly wrong, or at least way too strong. The most intuitively obvious way I've seen this is reading the cipher example in the o1 blog post, in the section titled "Chain of Thought." If you click on where it says "Thought for 5 seconds" for o1, it reveals the whole chain of thought. It's pretty long, maybe takes 5 mins to skim, but it's well worth the time for building intuition about how the most cutting edge model thinks imo. The model uses CoT to figure out a cipher and decode it. I think it's intuitively obvious that the model could not have solved this problem without CoT.
Additionally, when trying to search for this paper, I found this paper on arxiv which finds situations where the CoT is just rationalizing the decision made by the LLM. If you look at papers which cite this paper, you will find other research in this vain.
True. I trust post-hoc explanations much less than pre-answer reasoning for problems that seem to require a lot of serial reasoning, like the o1 cipher problem. This post and this comment on it discuss different types of CoT unfaithfulness in a way similar to how I'm thinking about it, highly recommend.
But why are the Aether team organising these mini-sprints? The short summary is that deception is a big risk in future AI systems, and they believe that nailing down what it means for LLMs and LLM agents to believe something is an important step to detecting and intervening on deceptive systems.
Fwiw only that one sprint was specifically on beliefs. I think I'm more interested in what the agents believe, and less in figuring out exactly what it means to believe things (although the latter might be necessary in some confusing cases). I'd say the sprints are more generally aimed at analyzing classic AI risk concepts in the context of foundation model agents, and getting people outside the core team to contribute to that effort.
Max Nadeau recently made a comment on another post that gave an opinionated summary of a lot of existing CoT faithfulness work, including steganography. I'd recommend reading that. I’m not aware of very much relevant literature here; it’s possible it exists and I haven’t heard about it, but I think it’s also possible that this is a new conversation that exists in tweets more than papers so far.
- Paper on inducing steganography and combatting it with rephrasing: Preventing Language Models From Hiding Their Reasoning
- Noting a difference between steganography and linguistic drift: I think rephrasing doesn’t make much sense as a defense against strong linguistic drift. If your AI is solving hard sequential reasoning problems with CoT that looks like “Blix farnozzium grellik thopswen…,” what is rephrasing going to do for you?
- Countering Language Drift via Visual Grounding (Meta, 2019)
- I haven’t looked at this closely enough to see if it’s really relevant, but it does say in the abstract “We find that agents that were initially pretrained to produce natural language can also experience detrimental language drift: when a non-linguistic reward is used in a goal-based task, e.g. some scalar success metric, the communication protocol may easily and radically diverge from natural language.” That sounds relevant.
- Andrej Karpathy suggesting that pushing o1-style RL further is likely to lead to linguistic drift: https://x.com/karpathy/status/1835561952258723930?s=46&t=foMweExRiWvAyWixlTSaFA
- In the o1 blog post, OpenAI said (under one interpretation) that they didn’t want to just penalize the model for saying harmful or deceptive plans in the CoT because that might lead it to keep having those plans but not writing them in CoT.
- “Assuming it is faithful and legible, the hidden chain of thought allows us to “read the mind” of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought.”
Could you please point out the work you have in mind here?
Here’s our current best guess at how the type signature of subproblems differs from e.g. an outermost objective. You know how, when you say your goal is to “buy some yoghurt”, there’s a bunch of implicit additional objectives like “don’t spend all your savings”, “don’t turn Japan into computronium”, “don’t die”, etc? Those implicit objectives are about respecting modularity; they’re a defining part of a “gap in a partial plan”. An “outermost objective” doesn’t have those implicit extra constraints, and is therefore of a fundamentally different type from subproblems.
Most of the things you think of day-to-day as “problems” are, cognitively, subproblems.
Do you have a starting point for formalizing this? It sounds like subproblems are roughly proxies that could be Goodharted if (common sense) background goals aren't respected. Maybe a candidate starting point for formalizing subproblems, relative to an outermost objective, is "utility functions that closely match the outermost objective in a narrow domain"?
Lots of interesting thoughts, thanks for sharing!
You seem to have an unconventional view about death informed by your metaphysics (suggested by your responses to 56, 89, and 96), but I don’t fully see what it is. Can you elaborate?
Basic idea of 85 is that we generally agree there have been moral catastrophes in the past, such as widespread slavery. Are there ongoing moral catastrophes? I think factory farming is a pretty obvious one. There's a philosophy paper called "The Possibility of an Ongoing Moral Catastrophe" that gives more context.
How is there more than one solution manifold? If a solution manifold is a behavior manifold which corresponds to a global minimum train loss, and we're looking at an overparameterized regime, then isn't there only one solution manifold, which corresponds to achieving zero train loss?