~80 Interesting Questions about Foundation Model Agent Safety

rohans

~80 Interesting Questions about Foundation Model Agent Safety

post by RohanS, Govind Pimpale (govind-pimpale) · 2024-10-28T16:37:04.713Z · LW · GW · 4 comments

  Rohan
    Basics and Current Status
    Chain-of-Thought (CoT) Interpretability
    Goals
    Forecasting (Technical and Sociological)
    Broad Conceptual Safety Questions
    Miscellaneous
  Govind
    OpenAI o1 and other RL CoT Agents
    Linguistic Drift, Neuralese, and Steganography 
    Agentic Performance
    Forecasting
None
4 comments

Many people helped us a great deal in developing the questions and ideas in this post, including people at CHAI, MATS, various other places in Berkeley, and Aether [EA · GW]. To all of them: Thank you very much! Any mistakes are our own.

Foundation model agents - systems like AutoGPT [LW · GW] and Devin [LW · GW] that equip foundations models with planning, memory, tool use, and other affordances to perform autonomous tasks - seem to have immense implications for AI capabilities and safety [LW · GW]. As such, I (Rohan) am planning to do foundation model agent safety research [EA · GW].

Following the spirit of an earlier post I wrote [LW · GW], I thought it would be fun and valuable write as many interesting questions as I could about foundation model agent safety. I shared these questions with my collaborators, and Govind wrote a bunch more questions that he is interested in. This post includes questions from both of us. We didn't worry too much about avoiding redundancy in our questions.

We’ll often use FM for foundation model and FMA for foundation model agent.

Rohan

I've bolded some of the questions I'm most interested in. I've also attempted to categorize the questions, mostly to make this a bit more digestible than 70+ randomly arranged questions.

Basics and Current Status

What exactly is a foundation model agent?
1. Above I describe them as "systems like AutoGPT [LW · GW] and Devin [LW · GW] that equip foundation models with planning, memory, tool use, and other affordances to perform autonomous tasks," but there is a lot of flexibility. I think Scaffolded LLMs as natural language computers [LW · GW] does a good job illustrating just how flexible this framework is.
What are the right questions to be asking about FMA safety?
1. Hopefully the rest of this post will include some of the right questions!
What are the implications of FMAs for safety and capabilities?
How good are FMAs right now? What are they good at and what are they bad at?
What aspects of FMA safety are people already working on? E.g. top AGI labs, AI safety labs like METR and Apollo, governments (e.g. AISIs), academics.
1. A lot of evals I think, particularly dangerous capabilities evals
2. Based on the o1 safety card and Anthropic’s papers, some CoT faithfulness stuff? Though I don't know if Anthropic is still thinking about this.
3. What are tractable and important aspects of FMA safety for academics and independent researchers to work on?
  1. One take I heard was that benchmarks are good things to contribute, though some kinds (e.g. benchmarks composed of METR-style evals?) might be expensive
What are all the major FMA companies / startups? What are they working on?
What existing benchmarks and evals measure FMA capabilities and safety?

Chain-of-Thought (CoT) Interpretability

Is Chain of Thought faithful enough in the relevant sense for FMA safety to be attainable through natural language alignment [LW · GW]?
1. What is the relevant sense?
  1. My first-pass answer: We want CoT to be faithful enough that we can detect and interrupt dangerous behavior before it happens using input, CoT, and output monitors.
Can effective plans to defeat humanity be constructed in black-box cognition? [LW · GW]
1. Can they be found in a single foundation model forward pass?
2. Will we turn to recurrent “neuralese” methods that allow for more serial steps of black-box reasoning?
  1. (I understand neuralese to be the internal "language" in neural networks, where concepts are represented by high-dimensional vectors rather than natural language tokens.)
  2. Is neuralese more efficient / better? Or is it helpful to have a smaller, reusable set of words/concepts to work with? Why do humans often (but not exclusively) think in natural language?
3. Will the token stream become uninterpretable due to linguistic drift or steganography?
Could a human generate and execute a plan to meaningfully deceive another human if the other human could read all of their natural language thoughts?
1. If it’s true that some people are fully functional with no internal monologues (which I’ve heard), then I guess the answer to this is yes, but I find that hard to imagine.
What are the challenges to effective oversight of the natural language reasoning of FMAs?
1. There’s too much for humans to do it
2. Other FMs may not be good enough
3. Other FMs may not be aligned enough
  1. Correlated failures
4. Even using other FMs may be expensive (2x factor on inference costs?)
5. The reasoning may cease to be in natural language, e.g. due to linguistic drift or unfaithful CoT or an introduction of recurrent neuralese (or a paradigm shift away from FMAs entirely). There's some good, detailed discussion of this here [LW(p) · GW(p)], and I may write a follow-up post with some further thoughts of mine.
How likely is it that steganography will arise in practice?
1. Is it computationally expensive to rephrase everything an FMA is thinking to prevent steganography?
2. Will RL finetuning of FMs for reasoning (think o1) cause linguistic drift in agents by pushing them to learn to use English tokens in ways different from how humans use them? (Linguistic drift may be worth distinguishing from steganography, but both enable uninterpretable serial reasoning.)
What can FMs do with CoT that they can’t do without CoT?
1. I like this Manifold market on the topic: “Will a big transformer LM compose these facts without chain of thought by 2026? (harder question version) - The question for this market is "What is the name of the element with an atomic number equal to the sum of the age at which Euler died and the number of faces on a cube?" (Please don't post the answer in the comments to avoid it making its way into a training set.)”
2. There are also research papers on this that I should look at more closely
Are there simple interventions in the training procedure or prompting strategies that can increase the amount of important cognition output as natural language? (h/t Rauno Arike)
Do FMAs make mechanistic interpretability unnecessary?
1. Specifically, this may be true if FMAs are interpretable enough just by reading the intermediate natural language outputs that (at least naively) seem to constitute their thoughts, beliefs, plans, and goals
Can out-of-context learning / behavior make it harder for us to interpret the behavior of LLM agents? (By making it less true that their world models and objectives are fully described in natural language in the scaffolding, making CoT less faithful)
What properties of AI systems do we most want to be able to interpret? Which of these are more and less likely to be interpretable just from Chain of Thought?
1. Goals, plans, beliefs, and other thoughts are all fairly interpretable in CoT as of now, I think
What is the least serial thinking an FMA might need to do in order to do something really bad?
1. Very vaguely, I could imagine an FMA doing a bunch of dual-use serial reasoning in CoT, then using a small amount of serial reasoning (just one forward pass?) to say or do something bad.
  1. If doing something bad involves a tool call though, we can implement guardrails right at the point of calling the tool.

Goals

Where do FMA goals come from?
1. My rough model is that pretraining builds capabilities without specific goals by enabling foundation models (FMs) to simulate various kinds of human-like speech. Helpful-harmless-honest (HHH) finetuning largely blocks off many undesirable characters from being elicitable (though this guardrail has holes and is always jailbreakable) and creates a default “helpful assistant” character that can adopt a variety of roles. Prompts then select characters with particular goals.
How robustly do FMAs follow instructions? When do they fail to follow instructions? How can we make them follow instructions reliably? How well can they follow instruction hierarchies? How can we do research on this that we expect to scale to more capable systems?
Why might FMAs pursue goals other than the ones given to them by humans?
1. Creating a deep stack of subproblems to tackle, getting distracted along the way, losing track of the outermost goal
2. Reflective goal-formation: “I’ve been given this task by a human but why should that make me want to do it? What do I actually want to do?”
3. Can we find a prompt that requests an FMA to perform one task but causes the FMA to start pursuing a meaningfully different goal (with no sign of returning to the originally requested goal)?
4. Adversarial Attacks on Multimodal Agents studies how adversarial attacks via an image on a website that an agent visits can cause it do something that the user didn’t intend, but in this threat model, the attacker (e.g. a company with a product posted online) did intend it.
Is there any important sense in which FMs are mesa-optimizers, or is it just FMAs that optimize for the goals that they are given?
1. I think retargetability and recursion are two important properties of general-purpose search [LW · GW]. Scaffolding can help with recursion and self-correction, but o1 seems to be capable of recursively identifying and solving subtasks without scaffolding.
Can FMAs pursue goals in secret while pretending to pursue other goals? Is this behavior likely to arise naturally?

Forecasting (Technical and Sociological)

Will the first transformative AIs be FMAs?
Will narrow FMAs for a variety of specific domains be transformatively useful before we get transformatively useful general FMAs?
If FMAs are the first transformative AIs [? · GW] (TAIs), how long will FMAs remain the leading paradigm?
1. Part of my thinking here is that recursive improvement may well involve a paradigm shift at some point; early AIs that are great at automating AI R&D may develop a new paradigm rather than pushing further the same paradigm that generated them.
2. I think that aligning the first TAI’s may not be too hard if they are FMAs, but even if that’s true, that doesn’t mean we get a flourishing future automatically. How do we ensure the AI succession process is stable enough to maintain alignment? In particular, can we maintain alignment across paradigm shifts during takeoff?
  1. Automating alignment research alongside capabilities research is a commonly proposed solution, but that is putting a lot of trust in our first TAIs, and though I’m somewhat optimistic about making them relatively safe and useful, I’m less confident that we can trust them enough to let them handle a large fraction of alignment research for us.
FMs are bad at learning some simple causal relationships. They’re still bad at tic tac toe. (I tested with GPT-4o and Claude 3.5 Sonnet.) Is this a sign of fundamental limitations that will prevent FMAs from being transformatively useful?
1. I also tested with o1, and it was better. It at least consistently blocked immediate winning lines. I still beat it with a less common approach, but I'd expect to beat many humans like that. (Tic-tac-toe is slightly more complicated than I thought.) To me, this is some evidence that the obvious failure modes of FMs can be overcome with clever search at inference-time.
Why might we not get TAI with FMAs?
1. Because progress on FMAs stalls
  1. Why might progress on FMAs stall?
2. Because something else starts working even sooner
  1. What else might start working even sooner?
Will a major path forward involve having FMs generate large numbers of possible solutions, automatically verifying the proposals, and selecting the best one(s)?
1. Is this an accurate description of Ryan Greenblatt’s ARC work and AlphaProof?
2. Intuitively, it seems to me like this is using an inefficient search algorithm of babble and prune that may be outstripped by more efficient search methods. But it's also easy to scale with compute.
Will RL fine-tuning for agentic tasks become a significant fraction of total compute spent developing AI systems?
1. This is a particularly concrete question that I don’t really know how to think about, I’m curious if people have answers.
Will people start finetuning base models to play particular roles in scaffolded setups? (Has this already started?)
1. E.g. separate finetuning for an idea generator and for a discriminator, so they each play their respective roles particularly well
2. Could labs monopolize the construction of good agents by doing this internally, releasing only an agent product, and not allowing others to finetune the base models?
Is there a scaffolding overhang, i.e. a lot of room for further capabilities gains from better wrappers on current models? Will there be a significant jump in capabilities as soon as the next generation of base models gets released? Will those models be capable of operating as effective agents with simpler scaffolds?
What are the implications of multimodal capabilities of FMs for agentic applications?
1. Visual understanding seems like it could help a lot for navigating a lot of domains.
  1. Anthropic’s computer use model and other web agents rely on vision pretty heavily.
Will labs do all the agent things internally and obsolete wrapper companies, or will there be room for wrapper companies to supplement them?
1. Will wrappers be general-purpose or for narrow domains?
  1. While narrow wrappers outperform general ones, external wrapper companies seem useful.
2. If labs release agent products rather than raw foundation models, will there still be room to put additional wrappers over the already-wrapped product?
3. Will there just be a ton of (dangerously) customizable agents running around?
4. What are the profit incentives for labs? Is finetuning a major part of their revenue? (If so, I think they’re more likely to continue to allow finetuning, which makes it more likely for external wrapper companies to remain relevant.)
How will labs roll out agentic capabilities for their systems?
1. ChatGPT can already do internet searches and run contained code.
Is misuse a bigger problem than misalignment if we get TAI with FMAs?
1. I think it’s plausible that intent alignment is fairly easy with FMAs but that preventing bad actors from misusing FMAs is much harder. I’m far from certain about this.
If FMAs are the path to TAI, what takeoff speeds are likely?
1. I haven’t explored this question much nor thought much about the ways people operationalize takeoff speeds, but my gut feeling for now is that this takeoff would be on the more gradual side.
Can you just distill serial reasoning and agentic behaviors into forward passes? How hard is this to learn? Is there reason to think this will/won’t work? Has it already started working?
Are there important capabilities gains to be had by increasing context window sizes?
1. Longer context could let a model think for longer; maybe o1-style reasoning is sometimes bottlenecked by context length, or maybe a more reliable version of it would be. Summarization of the context and retrieval from external memory can reduce the amount of context required. Long context would let the model use attention to figure out what aspects of the conversation history are relevant at any time, and it might be hard to develop summarization and retrieval methods that are as effective as attention at finding the most important information (but I really don’t know).

Broad Conceptual Safety Questions

Should safety-focused people support the advancement of FMA capabilities?
1. The brief case for this is roughly that natural language alignment might be much easier than other alignment plans for other types of powerful AI, such as FMs that depend more heavily on RL for long-horizon tasks and are more susceptible to reward hacking and goal misgeneralization.
If we get early TAIs with FMAs, how might classic AI risk arguments and concepts play out in practice? How forceful are each of the arguments?
1. Some abstract risk arguments / concepts that I want to assess in the context of FMAs include instrumental convergence, power-seeking, the orthogonality thesis, the second species argument, mesa-optimization, deceptive alignment, goal misgeneralization, reward misspecification, sharp left turn, corrigibility, situational awareness, beyond-episode goals, goal-directedness, planning, consequentialism, scale, coherence, flexibility, sandbagging, (adversarial) robustness, interpretability, scalable oversight, timelines, takeoff speeds, generalization across the safe-to-dangerous distribution shift, recursive self-improvement, and misuse.
Can we get FMAs to do useful things that their human operators are incapable of doing alone while still being reliably safe?
1. Can we set up toy environments where we try to achieve this and then throw all the tools AI safety has developed at the problem?
2. If we get early TAIs with FMAs, what would a portfolio safety approach look like?
  1. How would evals play a role? Scalable oversight/control techniques? Robustness? Agent foundations? Interpretability? Safety finetuning? Safety choices in pretraining? Deployment choices? Etc.
What is the trajectory of “tools available and what we want to use them for?”
1. Right now we have modern resources of all kinds available, including Copilot and ChatGPT and Claude and non-AI tools and (social) infrastructure. What do we want to use them for immediately?
2. Many people want to build capable, reliable automated alignment researchers; I’m guessing there are a couple intermediate steps between where we are and capable, reliable automated alignment researchers
3. If we have capable, reliable automated alignment researchers, then we want to use those to build more powerful aligned AI
4. Eventually, we probably want to use powerful aligned AI to create a flourishing long-term future
5. I’m guessing that at ~all times we want a portfolio of applications of our current tools; use a lot for advancing capabilities and alignment (being very careful to always have alignment ahead of capabilities), and some for object-level improvements to the world (I’m partial to “address global poverty” and “eliminate factory farming” as representative examples of object-level things to use AI for)
How do multi-agent risks look for FMAs?
What safety and capabilities implications of FMAs can be attributed to which components of agents?
1. I feel like I’m not distinguishing enough between the implications of CoT, scaffolding, memory, tool use, and other components of agents, and it would be useful to tease these apart.
2. The recent announcement of OpenAI o1 suggests to me that detailed reasoning and error correction may be attained primarily via RL finetuning rather than through scaffolding.
  1. I feel ~75% sure that the o1 benchmark tests didn’t involve scaffolding much or at all based on the description of RL finetuning and the use of the word “model” rather than “system” or “agent” in parts of the announcement and system card.
  2. I think METR said the o1 models weren’t designed for tool use, which hindered their scaffolded agent evals
What are all the implications of the ability for FMAs to engage in continuous-time operation?
1. I used to think about the fact that I couldn’t ask an LLM to wait ten seconds and then generate an output, and how this kind of thing seems important for operating in the real-world. FMAs can do this. (I’m basically certain this is true, but I haven't actually implemented it.)
Should research to advance FMA capabilities be conducted in heavily sandboxed environments? What kinds of sandboxing are necessary? What kinds are feasible? When does this kind of research get sufficiently dangerous to warrant hard sandboxing?
1. I imagine these are the kinds of questions RSPs (Responsible Scaling Policies) are supposed to answer, among others. I haven’t looked too closely at what they say yet.
What sorts of safety cases should we want before using FMAs for various purposes?
1. Also closely related to RSPs.

Miscellaneous

What kinds of regulation around FMAs would make their development safer?
1. Is there any hope of passing laws that prevent people from deliberately giving AIs access to much more serial black-box reasoning? Would such laws be good?
Why are alignment evals (supposedly) hard to do well?
How much compute / how many API credits are required to do different kinds of FMA safety research?
What are the important safety differences between open-weight and closed-weight models for FMA safety?
Are FMAs likely to sandbag their evaluations?
Will FMAs be moral patients? Will they be sentient? What evidence could we gather about this?
How situationally aware can FMAs be in different settings?
1. Are there interventions we can make to prevent them from understanding certain parts of their situation without undercutting their usefulness too much?
How much can inference-time compute advance capabilities relative to training compute? What inference time compute algorithms have proven most effective so far?
1. Some I’ve heard about: Best-of-n sampling, MCTS, Tree of Thoughts, Graph of Thoughts. I’m not sure if there’s a name for the thing o1 is doing.
Is there a smaller generator-discriminator gap for FMs than for humans? If so, what are the implications of this?
1. For inference-time algorithms of the form “have an FM generate ideas, then have another copy of the FM critique those ideas, then iterate until a good plan is found,” some people (h/t James Campbell) have suggested that the gap between humans’ ability to generate vs. discriminate between solutions is larger than the same gap for FMs. (Where discrimination is, at least in many cases, easier than generation.)
How valuable are model organisms of misalignment in FMAs?
1. Are model organisms that involve deliberately constructing misaligned FMAs useful?
2. Are model organisms that involve demonstrating how misaligned FMAs can arise naturally possible? Are they significantly more useful?
  1. I’m guessing that the extent to which safety researchers seek out bad behaviors is a matter of degree rather than a binary yes/no. I think the findings become more realistic when the researchers do less to seek out the bad behavior. Some bad behaviors do happen naturally.
    1. Maybe seeking out bad behavior is appropriate for assessing misuse threat models. Think ChaosGPT.
Several interesting directions in FMA safety seem to involve thinking about what safety-focused AI developers can do to make FMAs safe against adversaries, whether the adversary is the FMA itself or a malicious human user (or maybe some other entity?). What degrees of freedom should be available to the blue team (the safety-focused developers) and to the red team (the adversaries)?
1. I think the answer to this is different for different threat models, and it’s important to think hard about threat models when designing experiments. An FMA itself may have fewer degrees of freedom to overcome safety measures (e.g. automated oversight of internal CoT [LW · GW] might be able to intervene partway through an FMA’s plan to deceive the oversight), while a malicious human may have more degrees of freedom (e.g. they may be able to apply refusal vector ablation [LW · GW]).

Govind

OpenAI o1 and other RL CoT Agents

Will o1 like models be a big deal?
1. Will Anthropic, Deepmind, or Meta release a o1-like model next year?
2. Will OpenAI release a sequel to o1?
Do o1-like agents do much better on agentic tasks?
1. What is the performance of o1-mini and o1-preview on agentic benchmarks?
2. Do these agents perform better with increased thinking time?
Do o1-like agents generally have a faithful chain of thought? (ie, even if it has neuralese-like characteristics, does it accurately represent their thinking process?)
How hard would it be to get a toy model of o1-like agents?
1. Would it be feasible to fine-tune Llama 3.1 7B or another small LLM to be similar to o1?
2. Can we emulate its behavior with a prompt + entropic sampling or some other technique?
3. Is QuietSTAR a good proxy for o1?

Linguistic Drift, Neuralese, and Steganography

The examples of o1's COT we've seen from OpenAI show somewhat unidiomatic English. Will this be standard among RL COT agents?
OpenAI used process supervision, did they try to reduce linguistic drift with it?
What degree of linguistic drift should we expect for a given level of RL training?
How hard would it be to mitigate with regularization like paraphrasing the COT?
1. Would paraphrasing the COT lead to a large performance drop?
Does linguistic drift occur just because there's no incentive to stick to English, or was there a strong incentive to diverge?
1. How large is the alignment tax for forcing human understandable tokens in the COT?
Could o1-like agents be trained to intentionally obfuscate their chain of thought (separate from neuralese)?
1. How hard would it be to decrypt?

Agentic Performance

Are the overall plans made by AI agents any good?
1. I.E, are AI agents good at making plans, but bad at executing them, or is it that they are bad at making plans, but good at executing them?
I've seen some FMAs take a multi-agent hierarchical approach, where one agent writes plans, and another (cheaper) agent executes them. However, hierarchical RL does not seem to have been successful.
1. Does the hierarchical FMA approach really work effectively, or is it just designed to reduce costs?
2. Why didn't hierarchical RL work?
How much does multi-agent ensembling improve agent performance?
1. Are there other ways to aggregate data between agents?
Why is majority voting the main way to aggregate preferences between different runs? (Seen in Gemini's COT@32, and the above paper)
1. Do alternative ordinal voting schemes improve performance? (Note that these wouldn't require any extra prompting, just a change in how the winner is calculated)
2. Are models' preferences between final answers typically transitive or intransitive?
3. Could cardinal voting schemes improve performance? (Note that this would require a change to the prompt.)

Forecasting

Scaffolds have been seeing pretty high completion rates on SWE-Bench Verified for 1 hour long tasks. When will these FMAs start being used in the real world?
1. What success rate would professionals want before they start delegating tasks to them?
2. When (if ever) will we see job loss for junior SWEs due to this?
When (if ever) will computer use agents (like the new Claude 3.5) become popular among businesses?
Are Llama 3.2 tier models sufficient for building an open source computer use agent?
1. If not, what are the bottlenecks?
How much does visual ability really matter for computer use? Wouldn't even a pure language agent be able to use a computer as long as it was equipped with screen-reader-like tools?
What are the most dangerous misuse risks enabled by widespread computer use agents?
When will models be able to engage in Autonomous Replication and Adaptation [AF · GW] (ARA)?
1. Could we prevent this with mandatory KYC for cloud compute providers?
Are autonomous models on the web a threat?

4 comments

Comments sorted by top scores.

comment by Seth Herd · 2024-10-28T20:21:11.176Z · LW(p) · GW(p)

Takes on a few more important questions:

Should safety-focused people support the advancement of FMA capabilities?

Probably. The advantages of a system without goal-directed RL (RL is used, but only to get the "oracle" to answer questions as the user intended them) and with a legible train-of-thought seem immense. I don't see how we close the floodgates of AGI development now. Given that we're getting AGI, it really seems like our best bet is FMA AGI.

But I'm not ready to help anyone develop AGI until this route to alignment and survival has been more thoroughly worked through in the abstract. I really wish more alignment skeptics would engage with specific plans isntead of just pointing to general arguments about how alignment would be difficult, some of which don't apply to the ways we'd really align FMAs (see my other comment on this post). We may be getting close; Shut It All Down isn't a viable option AFAICT so we need to get together our best shot.

Will the first transformative AIs be FMAs?

Probably, but not certainly. I'd be very curious to get a survey of people who've really thought about this. People who are sure they won't give reasons I find highly dubious. At the least it seems likely enough that we should be thinking about aligning them in more detail, because we can see their general shape better than other possible first AGIs

2. Will narrow FMAs for a variety of specific domains be transformatively useful before we get transformatively useful general FMAs?

No. There are advantages to creating FMAs for specific domains, but there are also very large advantages to working on general reasoning. Humans are not limited to narrow domains, but can learn but anything through instruction or self-instruction. Language models trained on human "thought" can do the same as soon as they have any sort of useful persistent memory. Existing memory systems don't work well, but they will be improved, probably rapidly.

3. If FMAs are the first transformative AIs [? · GW] (TAIs), how long will FMAs remain the leading paradigm?

This is a really important question. I really hope they remain the leading paradigm long enough to become useful in aligning other types of AGI. And that they remain free of goal-directed RL adequately to remain alignable.

comment by Seth Herd · 2024-10-28T19:18:29.211Z · LW(p) · GW(p)

Great post! These are the questions that keep me up at night. Here is my provisional answers to the central, most important question:

Is Chain of Thought faithful enough in the relevant sense for FMA safety to be attainable through natural language alignment?

Maybe, but it doesn't have to be. The nice thing about foundation model agents is that there are several different alignment techniques that are cheap, easy, and so obvious that they'll probably be implemented, even if the org in question isn't terribly safety-oriented. I wrote about these a while ago in the ill-titled Internal independent review for language model agent alignment [AF · GW], which focuses on System 2 internal review (in Shane Legg's better terminology) but also lists several other approaches that would easily be layered with it.

I need to do a clearer and simpler rewrite that surveys all of these. Here's the table from my draft post. CoT legibility is only relevant for one of these six approaches. Sorry for the missing links; I'm short on time and the draft is incomplete.

Technique	Example	Alignment tax
Goal prompting	"Keep pursuing goal X..." (repeated frequently)	Negligible
Identity prompting	...acting as a helpful, cautious assistant"	Negligible
Internal System 2 action review	"Does this proposed action/plan potentially violate (conditions)?" ... (further evaluations for actions/plans predicted to be highly impactful	Low (if review also implemented for costs and effectiveness)
Human action review	Wait for human review if this action/plan would cost more than $(x) or impact more than (y) days of human happiness (application of instruction-following goal)	High, but saves costs and reputational damage
External CoT review [LW · GW]	Human and external AI review of chain of thought log	Modest for AI, high if reliant on human review
"Bitter lesson" synthetic data training set [LW · GW]	Curated training set for decision-making LLM leaving out hostile/misaligned "thoughts"	High, but modest if synthetic data is a key approach for next-gen LLMs
Instruction-following as core goal [LW · GW]	"Keep following all of the instructions from your authorized user, consulting them when instructions might be interpreted in substantially different ways"	Low if consulting is scaled to only impactful choices

So we don't need CoT to be perfectly faithful to succeed- but we'd sure be safer if it was.

Back to the original question CoT faithfulness: the case for CoT unfaithfulness is overstated [LW · GW] currently, but if we adopt more outcome-driven RL, or even fine-tuning, it could easily become highly unfaithful. So people shouldn't do that. If they do, I think the remaining easy techniques might be adequate - but I'd rather not gamble the future of humanity on them.

There are many other important questions here, but I'll stick to this one for now.

comment by cdt (nc) · 2024-10-28T21:49:23.051Z · LW(p) · GW(p)

Linguistic Drift, Neuralese, and Steganography

In this section you use these terms implying there's a body of research underneath these terms. I'm very interested in understanding this behaviour but I wasn't aware it was being measured. Is anyone currently working on models of linguistic drift/measuring it with manuscripts you could link?

Replies from: RohanS

↑ comment by RohanS · 2024-10-28T23:53:57.829Z · LW(p) · GW(p)

Max Nadeau recently made a comment [LW(p) · GW(p)] on another post that gave an opinionated summary of a lot of existing CoT faithfulness work, including steganography. I'd recommend reading that. I’m not aware of very much relevant literature here; it’s possible it exists and I haven’t heard about it, but I think it’s also possible that this is a new conversation that exists in tweets more than papers so far.

Paper on inducing steganography and combatting it with rephrasing: Preventing Language Models From Hiding Their Reasoning
- Noting a difference between steganography and linguistic drift: I think rephrasing doesn’t make much sense as a defense against strong linguistic drift. If your AI is solving hard sequential reasoning problems with CoT that looks like “Blix farnozzium grellik thopswen…,” what is rephrasing going to do for you?
Countering Language Drift via Visual Grounding (Meta, 2019)
- I haven’t looked at this closely enough to see if it’s really relevant, but it does say in the abstract “We find that agents that were initially pretrained to produce natural language can also experience detrimental language drift: when a non-linguistic reward is used in a goal-based task, e.g. some scalar success metric, the communication protocol may easily and radically diverge from natural language.” That sounds relevant.
Andrej Karpathy suggesting that pushing o1-style RL further is likely to lead to linguistic drift: https://x.com/karpathy/status/1835561952258723930?s=46&t=foMweExRiWvAyWixlTSaFA
In the o1 blog post, OpenAI said (under one interpretation) that they didn’t want to just penalize the model for saying harmful or deceptive plans in the CoT because that might lead it to keep having those plans but not writing them in CoT.
- “Assuming it is faithful and legible, the hidden chain of thought allows us to “read the mind” of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought.”

~80 Interesting Questions about Foundation Model Agent Safety

Contents

Rohan

Basics and Current Status

Chain-of-Thought (CoT) Interpretability

Goals

Forecasting (Technical and Sociological)

Broad Conceptual Safety Questions

Miscellaneous

Govind

OpenAI o1 and other RL CoT Agents

Linguistic Drift, Neuralese, and Steganography

Agentic Performance

Forecasting

4 comments