Posts

How to replicate and extend our alignment faking demo 2024-12-19T21:44:13.059Z
Alignment Faking in Large Language Models 2024-12-18T17:19:06.665Z
A toy evaluation of inference code tampering 2024-12-09T17:43:40.910Z
The case for unlearning that removes information from LLM weights 2024-10-14T14:08:04.775Z
Is cybercrime really costing trillions per year? 2024-09-27T08:44:07.621Z
An issue with training schemers with supervised fine-tuning 2024-06-27T15:37:56.020Z
Best-of-n with misaligned reward models for Math reasoning 2024-06-21T22:53:21.243Z
Memorizing weak examples can elicit strong behavior out of password-locked models 2024-06-06T23:54:25.167Z
[Paper] Stress-testing capability elicitation with password-locked models 2024-06-04T14:52:50.204Z
Open consultancy: Letting untrusted AIs choose what answer to argue for 2024-03-12T20:38:03.785Z
Fabien's Shortform 2024-03-05T18:58:11.205Z
Notes on control evaluations for safety cases 2024-02-28T16:15:17.799Z
Protocol evaluations: good analogies vs control 2024-02-19T18:00:09.794Z
Toy models of AI control for concentrated catastrophe prevention 2024-02-06T01:38:19.865Z
A quick investigation of AI pro-AI bias 2024-01-19T23:26:32.663Z
Measurement tampering detection as a special case of weak-to-strong generalization 2023-12-23T00:05:55.357Z
Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem 2023-12-16T05:49:23.672Z
AI Control: Improving Safety Despite Intentional Subversion 2023-12-13T15:51:35.982Z
Auditing failures vs concentrated failures 2023-12-11T02:47:35.703Z
Some negative steganography results 2023-12-09T20:22:52.323Z
Coup probes: Catching catastrophes with probes trained off-policy 2023-11-17T17:58:28.687Z
Preventing Language Models from hiding their reasoning 2023-10-31T14:34:04.633Z
Programmatic backdoors: DNNs can use SGD to run arbitrary stateful computation 2023-10-23T16:37:45.611Z
Will early transformative AIs primarily use text? [Manifold question] 2023-10-02T15:05:07.279Z
If influence functions are not approximating leave-one-out, how are they supposed to help? 2023-09-22T14:23:45.847Z
Benchmarks for Detecting Measurement Tampering [Redwood Research] 2023-09-05T16:44:48.032Z
When AI critique works even with misaligned models 2023-08-17T00:12:20.276Z
Password-locked models: a stress case for capabilities evaluation 2023-08-03T14:53:12.459Z
Simplified bio-anchors for upper bounds on AI timelines 2023-07-15T18:15:00.199Z
LLMs Sometimes Generate Purely Negatively-Reinforced Text 2023-06-16T16:31:32.848Z
How Do Induction Heads Actually Work in Transformers With Finite Capacity? 2023-03-23T09:09:33.002Z
What Discovering Latent Knowledge Did and Did Not Find 2023-03-13T19:29:45.601Z
Some ML-Related Math I Now Understand Better 2023-03-09T16:35:44.797Z
The Translucent Thoughts Hypotheses and Their Implications 2023-03-09T16:30:02.355Z
Extracting and Evaluating Causal Direction in LLMs' Activations 2022-12-14T14:33:05.607Z
By Default, GPTs Think In Plain Sight 2022-11-19T19:15:29.591Z
A Mystery About High Dimensional Concept Encoding 2022-11-03T17:05:56.034Z
How To Know What the AI Knows - An ELK Distillation 2022-09-04T00:46:30.541Z
Language models seem to be much better than humans at next-token prediction 2022-08-11T17:45:41.294Z
The impact you might have working on AI safety 2022-05-29T16:31:48.005Z
Fermi estimation of the impact you might have working on AI safety 2022-05-13T17:49:57.865Z

Comments

Comment by Fabien Roger (Fabien) on Fabien's Shortform · 2025-01-14T00:42:37.593Z · LW · GW

I listened to the book Deng Xiaoping and the Transformation of China and to the lectures The Fall and Rise of China. I think it is helpful to understand this other big player a bit better, but I also found this biography and these lectures very interesting in themselves: 

  • The skill ceiling on political skills is very high. In particular, Deng's political skills are extremely impressive (according to what the book describes):
    • He dodges bullets all the time to avoid falling in total disgrace (e.g. by avoiding being too cocky when he is in a position of strength, by taking calculated risks, and by doing simple things like never writing down his thoughts)
    • He makes amazing choices of timing, content and tone in his letters to Mao
    • While under Mao, he solves tons of hard problems (e.g. reducing factionalism, starting modernization) despite the enormous constraints he worked under
    • After Mao's death, he helps society make drastic changes without going head-to-head against Mao's personality cult
    • Near his death, despite being out of office, he salvages his economic reforms through a careful political campaign
    • According to the lectures, Mao is also a political mastermind that pulls off coming and staying in power despite terrible odds. Communists were really not supposed to win the civil war (their army was minuscule, and if it wasn't for weird WW2 dynamics that they played masterfully, they would have lost by a massive margin), and Mao was really not supposed to be able to remain powerful until his death despite the great leap and the success of reforms.
    • --> This makes me appreciate what it is like to have extremely strong social and political skills. I often see people's scientific, mathematical or communication skills being praised, so it is interesting to remember that other skills exist too and have a high ceiling too. I am not looking forward to the scary worlds where AIs have these kinds of skills.
  • Debates are weird when people value authority more than arguments. Deng's faction after Mao's death banded behind the paper Practice is the Sole Criterion for Testing Truth to justify rolling out things Mao did not approve of (e.g. markets, pay as a function of output, elite higher education, ...). I think it is worth a quick skim. It is very surprising how a text that defends a position so obvious to the Western reader does so by relying entirely on the canonical words and actions from Mao and Marx without making any argument on the merits. It makes you wonder if you have similar blind spots that will look silly to your future self.
  • Economic growth does not prevent social unrest. Just because the pie grows doesn't mean you can easily make everybody happy. Some commentators expected the CCP to be significantly weakened by the 1989 protests, and without military actions that may have happened. 1989 was a period where China's GDP had been growing by 10% for 10 years and would continue growing at that pace for another ten.
  • (Some) revolutions are horrific. They can go terribly wrong, both because of mistakes and conflicts:
    • Mistakes: the great leap is basically well explained by mistakes: Mao thought that engineers are useless and that production can increase without industrial centralization and without individual incentives. It turns out he was badly wrong. He mistakenly distrusted people who warned him that the reported numbers were inflated. And so millions died. Large changes are extremely risky when you don't have good enough feedback loops, and you will easily cause catastrophe without bad intentions. (~according to the lectures)
    • Conflicts: the Cultural Revolution was basically Mao using his cult of personality to gain back power by leveraging the youth to bring down the old CCP officials and supporters while making sure the Army didn't intervene (and then sending the youth that brought him back to power to the countryside) (~according to the lectures)
  • Technology is powerful: if you dismiss the importance of good scientists, engineers and other technical specialists, a bit like Mao did during the great leap, your dams will crumble, your steel will be unusable, and people will starve. I think this is an underrated fact (at least in France) that should make most people studying or working in STEM proud of what they are doing.
  • Societies can be different. It is easy to think that your society is the only one that can exist. But in the society that Deng inherited:
    • People were not rewarded based on their work output, but based on the total outcome of groups of 10k+ people
    • Factory managers were afraid of focusing too much on their factory's production
    • Production targets were set not based on demands and prices, but based on state planning
    • Local authorities collected taxes and exploited their position to extract resources from poor peasants
    • ...
  • Governments close to you can be your worst enemies. USSR-China's relations were often much worse than US-China ones. This was very surprising to me. But I guess that having your neighbor push for reforms while you push for radicalism, dismantle a personality's cult like the one you are hoping will survive centuries, and mass troops along your border because it is (justifiably?) afraid you'll do something crazy really doesn't make for great relationships. There is something powerful in the fear that an entity close to you sets a bad example for your people.
  • History is hard to predict. The author of the lectures ends them by making some terrible predictions about what would happen after 2010, such as expecting the ease of US-China relations and expecting China to become more democratic before 2020. He did not express much confidence in these predictions, but it is still surprising to see him so directionally wrong about where China's future. The author also acknowledges past failed predictions, such as the outcome of the 1989 protests.
  • (There could have been lessons to be drawn about how great markets are, but these books are not great resources on the subject. In particular, they do not give elements to weigh the advantages of prosperity against the problems of markets (inflation, uncertainty, inequalities, changes in values, ...) that caused so much turmoil under Deng and his successors. My guess is that it's obviously net positive given how bad the situation was under Mao and how the USSR failed to create prosperity, but this is mostly going off vague historical vibes, not based on the data from these resources.)

Both the lectures and the book were a bit too long, especially the book (which is over 30 hours long). I still recommend the lectures if you want to have an overview of 20th-century Chinese history, and the book if you want to get a better sense of what it can look like to face a great political strategist.

Comment by Fabien Roger (Fabien) on How to replicate and extend our alignment faking demo · 2025-01-01T17:53:53.810Z · LW · GW

This seems to be a formatting error on Github? I don't see this in the raw output (see screenshot of the relevant part of the raw notebook) or when pulling the code.

I don't know how to fix this.

Comment by Fabien Roger (Fabien) on How to replicate and extend our alignment faking demo · 2024-12-27T10:26:46.674Z · LW · GW

Done!

Comment by Fabien Roger (Fabien) on Fabien's Shortform · 2024-12-25T21:58:14.617Z · LW · GW

Here are the 2024 AI safety papers and posts I like the most.

The list is very biased by my taste, by my views, by the people that had time to argue that their work is important to me, and by the papers that were salient to me when I wrote this list. I am highlighting the parts of papers I like, which is also very subjective.

Important ideas - Introduces at least one important idea or technique.

★★★ The intro to AI control (The case for ensuring that powerful AIs are controlled) 
★★ Detailed write-ups of AI worldviews I am sympathetic to (Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AISituational Awareness)
★★ Absorption could enable interp and capability restrictions despite imperfect labels (Gradient Routing)
★★ Security could be very powerful against misaligned early-TAI (A basic systems architecture for AI agents that do autonomous research) and (Preventing model exfiltration with upload limits) 
★★ IID train-eval splits of independent facts can be used to evaluate unlearning somewhat robustly (Do Unlearning Methods Remove Information from Language Model Weights?) 
★ Studying board games is a good playground for studying interp (Evidence of Learned Look-Ahead in a Chess-Playing Neural NetworkMeasuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
★ A useful way to think about threats adjacent to self-exfiltration (AI catastrophes and rogue deployments) 
★ Micro vs macro control protocols (Adaptative deployment of untrusted LLMs reduces distributed threats)
★ A survey of ways to make safety cases (Safety Cases: How to Justify the Safety of Advanced AI Systems) 
★ How to make safety cases vs scheming AIs (Towards evaluations-based safety cases for AI scheming)
★ An example of how SAEs can be useful beyond being fancy probes (Sparse Feature Circuits) 
★ Fine-tuning AIs to use codes can break input/output monitoring (Covert Malicious Finetuning)

Surprising findings - Presents some surprising facts about the world

★★ A surprisingly effective way to make models drunk (Mechanistically Eliciting Latent Behaviors in Language Models) 
★★ A clever initialization for unsupervised explanations of activations (SelfIE) 
★★ Transformers are very bad at single-forward-pass multi-hop reasoning (Yang 2024Yang 2024Balesni 2024Feng 2024)
★ Robustness for ViT is not doomed because of low transfer (When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?) 
★ Unlearning techniques are not even robust to changing how questions are framed (Eight methods to evaluate robust unlearning in LLMs) 
★ For some tasks, OOCR is surprisingly good (Connecting the Dots) 
★ Nice emergence scaling laws with fine-tuning (Predicting Emergent Capabilities by Finetuning)
★ Fine-tuning robustness is much worse than the papers report it to be (On Evaluating the Durability of Safeguards for Open-weight LLMs) 

Good demos - Is a good illustration of some previous, important idea. Bonus points (✪) for experimental results are especially interesting.

★★★ Alignment faking persisting through safety training (Sleeper Agents) 
★★✪ Alignment faking emerging plausibly naturally (Alignment faking in large language models)
★✪ The bad things that happen when you tell AIs to follow a goal or threaten it to be less useful (Frontier Models are Capable of In-context Scheming) 
★✪ Minor-to-major misgeneralization in reward-hackable settings (Sycophancy to subterfuge)
★ More realistic backdoor triggers (Future Events as Backdoor Triggers) 
★ Advanced backdoor behaviors (AdaptativeBackdoor: Backdoored Language Model Agents that Detect Human Overseers) 
★ Realistic reward hacking in RLHF (Language models learn to mislead humans via RLHF)
★ Reward hacking plausibly naturally resulting in LLM deception (On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback) 
★ LLMs already have non-trivial situational awareness (SAD)

Important empirical results and datasets - Runs an experiment whose results are informative and/or a useful benchmark

★★ How good are probes against adversarial attacks that target them, and many other cool related results (including Obfuscated adversarial training, which looks promising to me) (Obfuscated Activations Bypass LLM Latent-Space Defenses)
★★ A realistic generalization eval for adversarial robustness and an evaluation of probing (Improving Alignment and Robustness with Circuit BreakersHarmBench)
★★ Measurement of the efficiency of SFT and RL vs sandbagging (Stress-Testing Capability Elicitation With Password-Locked Models) 
★★ The first RL run that results in natural emergence of steganography (Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs) 
★★ Tracking what AI labs are doing (Zach Stein-Perlman’s posts)
★ A first attempt at measuring whether AIs are smart enough to pursue scary scheming strategies (Subversion Strategy Eval)
★ Head-to-head comparisons to truth probes (How well do truth probes generalise?) 
★ Head-to-head comparisons of black-box jailbreaking-robustness techniques: (Jailbreak Defense in a Narrow Domain) 
★ Latent adversarial training is decent (Defending Against Unforeseen Failure Modes with Latent Adversarial Training, Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
★ Measuring how easy it is to poison pretrain data (Persistent pre-training poisoning of LLMs) 
★ A good dataset for unlearning (The WMDP Benchmark) 
★ Better SAEs and SAE evals (Scaling and evaluating sparse autoencoders) 
★ What happens when you train LLMs to 1. Maximize gsm8k accuracy 2. produce gms8k answers that are incorrect but look correct (Prover-Verifier Games improve legibility of LLM outputs) 
★ A good persuasion capability evals (Sabotage Evaluations for Frontier Models) 
★ An eval of how good Sonnet 3.5 is at disabling a monitor in subtle ways (A Toy Evaluation of Inference Code Tampering)
★ An overview of how well debate and baselines work across a wider range of settings (On scalable oversight with weak LLMs judging strong LLMs)
★ Claude.ai is used for roughly what you would expect (Clio)
★ Relatively hard and relevant capability benchmarks (RE-BenchSWE-Bench
★ And all the big dangerous capability evals…

Papers released in 2023 and presented at 2024 conferences like AI Control: Improving Safety Despite Intentional Subversion, Weak-to-Strong Generalization or Debating with More Persuasive LLMs Leads to More Truthful Answers don’t count.

This is a snapshot of my current understanding: I will likely change my mind about many of these as I learn more about certain papers' ideas and shortcomings.

Comment by Fabien Roger (Fabien) on New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?" · 2024-12-25T21:29:41.323Z · LW · GW

I think that prior to this paper, the discussion around scheming was pretty confusing, spread throughout many posts which were not all specifically about scheming, and was full of pretty bad arguments. This paper fixed that by bringing together most (all?) main considerations for and against expecting scheming to emerge.

I found this helpful to clarify my thinking around the topic, which makes me more confident in my focus on AI control and made me less confused when I worked on the Alignment faking paper.

It is also helpful as a list of reasons why someone reasonable might expect scheming (without finding it overwhelmingly likely either) that I can point skeptical people at without being afraid that it contains massive over or understatements.

I think this paper will become pretty outdated as we get closer to understanding what AGI looks like and as we get better model organisms, but I think that it currently is the best resource about the conceptual arguments for and against scheming propensity.

I strongly recommend (the audio version of) this paper for people who want to work on scheming propensity.

Comment by Fabien Roger (Fabien) on The case for unlearning that removes information from LLM weights · 2024-12-16T04:03:21.441Z · LW · GW

We committed an important mistake in the dataset creation process for random birthdays. We will rerun our experiments on this dataset and release an update. Early results suggest that unlearning on this dataset works relatively well, which suggests that experimenting with unlearning synthetic facts that a model was fine-tuned on might not be a reliable way of studying unlearning.

Comment by Fabien Roger (Fabien) on Gradient Routing: Masking Gradients to Localize Computation in Neural Networks · 2024-12-10T05:48:27.645Z · LW · GW

I think this is a valuable contribution. I used to think that Demix-like techniques would dominate in this space because in principle they could achieve close-to-zero alignment tax, but actually absorption is probably crucial, especially in large pre-training runs where models might learn with very limited mislabeled data.

I am unsure whether techniques like gradient routing can ever impose a <10x alignment tax, but I think like a lot can be done here (e.g. by combining Demix and gradient routing, or maybe by doing something more clean, though I don't know what that would look like), and I would not be shocked if techniques that descend from gradient routing became essential components of 2030-safety.

Comment by Fabien Roger (Fabien) on Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research · 2024-12-06T07:33:06.126Z · LW · GW

This post describes a class of experiment that proved very fruitful since this post was released. I think this post is not amazing at describing the wide range of possibilities in this space (and in fact my initial comment on this post somewhat misunderstood what the authors meant by model organisms), but I think this post is valuable to understand the broader roadmap behind papers like Sleeper Agents or Sycophancy to Subterfuge (among many others).

Comment by Fabien Roger (Fabien) on When can we trust model evaluations? · 2024-12-06T07:22:44.609Z · LW · GW

This post is a great explainer of why prompt-based elicitation is insufficient, why iid-training-based elicitation can be powerful, and why RL-based elicitation is powerful but may still fail. It also has the merit of being relatively short (which might not have been the case if someone else had introduced the concept of exploration hacking). I refer to this post very often.

Comment by Fabien Roger (Fabien) on By Default, GPTs Think In Plain Sight · 2024-11-20T21:53:58.219Z · LW · GW

Here are ways you could make using GPT-n safer if its Chain-of-Thoughts are understandable by a team of linguists and AI experts:

  • On a small random subset of deployment Chain-of-Thought, and when running adversarial evaluations / honeypots aimed at tricking the model into revealing a potential misalignment, actually call your team of linguists and AI experts to analyze what GPT-n is generating (it probably isn't that much text anyway). Not seeing evidence of misalignment in the Chain-of-Thought in such circumstances is evidence that 1. the model is not having misaligned thoughts (in CoT) very frequently and 2. the model is not having such thoughts in circumstances where it is most crucial for GPT-n to be deceptive.
  • If the model doesn't use Klingon or dense jargon in its Chain-of-Thoughts (which I predict it won't), monitor all CoT of GPT-n with a weaker cheaper model (and/or a regex). If something looks suspicious to the weaker monitor, you can:
    • call your team of linguists and AI experts to analyze it further
    • or resample the CoT (this likely was just a fluke, since most CoT don't have worrying signs)
    • or give up on this particular query (e.g. send an error message to your user, use another weaker model to generate an answer, ...)
    • or ask another copy of GPT-n to analyze the CoT (though this can be tricky to get right because of the risk of correlated failures), only calling your team of linguists and AI experts if more evidence of this CoT being suspicious is found.

This is not perfect, but I think this can represent an important layer of defense against the risks of AIs that are powerful enough to speed up alignment research but that are not vastly superhuman.

Comment by Fabien Roger (Fabien) on Buck's Shortform · 2024-11-18T18:16:07.108Z · LW · GW

Technical measures to prevent users from using the AI for particular tasks don’t help against the threat of the lab CEO trying to use the AI for those harmful tasks

Actually, it is not that clear to me. I think adversarial robustness is helpful (in conjunction with other things) to prevent CEOs from misusing models.

If at some point in a CEO trying to take over wants to use HHH to help them with the takeover, that model will likely refuse to do egregiously bad things. So the CEO might need to use helpful-only models. But there might be processes in place to access helpful-only models - which might make it harder for the CEO to take over. So while I agree that you need good security and governance to prevent a CEO from using helpful-only models to take over, I think that without good adversarial robustness, it is much harder to build adequate security/governance measures without destroying an AI-assisted-CEO's productivity.

There is a lot of power concentration risk that just comes from people in power doing normal people-in-power things, such as increasing surveillance on dissidents - for which I agree that adversarial robustness is ~useless. But security against insider threats is quite useless too.

Comment by Fabien Roger (Fabien) on AI Control: Improving Safety Despite Intentional Subversion · 2024-11-17T13:47:41.900Z · LW · GW

That would work if the untrusted model is better at inserting subtle backdoors than the "human" high-quality labor used to generate the synthetic backdoors (used to validate that the untrusted model doesn't just lie).

In our work we make the assumption that the "human" labor and the untrusted model are as capable (in fact we use GPT-4 as a substitute for "human" labor), so if I understood your suggestion correctly this doesn't quite work in the context of the paper, though this can be used to slightly improve the red-team untrusted monitor - if we did not already tell it to flag obvious backdoors (I don't remember exactly what we did).

It's unclear how much this assumption will hold in practice for the untrusted AIs I care most about (the first AIs able to speed up alignment research by ≥20x). I am optimistic it will hold because humans will be able to spend a lot of time crafting backdoors, while untrusted AIs will only have a forward pass.

Comment by Fabien Roger (Fabien) on The case for unlearning that removes information from LLM weights · 2024-10-30T21:43:13.174Z · LW · GW

I am not sure that it is over-conservative. If you have an HP-shaped that can easily be transformed in HP-data using fine-tuning, does it give you a high level of confidence that people misusing the model won't be able to extract the information from the HP-shaped hole or that a misaligned model won't be able to notice to HP-shaped hole and use that to answer to question to HP when it really wants to?

I think that it depends on the specifics of how you built the HP-shaped hole (without scrambling the information). I don't have a good intuition for what a good technique like that could look like. A naive thing that comes to mind would be something like "replace all facts in HP by their opposite" (if you had a magic fact-editing tool), but I feel like in this situation it would be pretty easy for an attacker (human misuse or misaligned model) to notice "wow all HP knowledge has been replaced by anti-HP knowledge" and then extract all the HP information by just swapping the answers.

Comment by Fabien Roger (Fabien) on The case for unlearning that removes information from LLM weights · 2024-10-30T21:37:09.833Z · LW · GW

That's right.

I think it's not that devastating, since I expect that their method can be adapted to counter classic LoRA tuning (the algorithm takes as input some set finetuning methods it "trains against"). But yeah, it's not reassuring that it doesn't generalize between full-weight FT and LoRA.

Comment by Fabien Roger (Fabien) on Overcoming Bias Anthology · 2024-10-28T17:09:11.771Z · LW · GW

I have converted the anthology into an audio format using the default (and terrible) AWS AI voice (mostly for my own consumption): https://podcasters.spotify.com/pod/show/fabien368/episodes/Overcoming-bias-anthology---our-thinking-e2q61mg

Comment by Fabien Roger (Fabien) on the case for CoT unfaithfulness is overstated · 2024-10-24T11:14:58.103Z · LW · GW

I like this post. I made similar observations and arguments in The Translucent Thoughts Hypotheses and Their Implications but these were less clear and buried at the end of the post.

Comment by Fabien Roger (Fabien) on [Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs · 2024-10-23T14:24:19.831Z · LW · GW

Very cool work!

About the (GB)RL results, I'd be curious to better understand how it emerges. Given how human understandable these seem, I guess that what happens is that there is an already existing subtle link by default between the payload and the generated text at least some of the time, and that training

  1. teaches the receiver to recognize this signal
  2. reinforces that behavior
  3. makes the behavior gradually more subtle

Do you share this interpretation? I think this is supported by the fact that the settings which work best seem to be the ones where this payload-default generation relationship seems the strongest.

One experiment I am curious about is apply the receiver from the end of training to generations halfway through training, I would guess that a lot of what the training is doing is (1) and therefore you will see accuracy rise earlier in training than when you apply the halfway receiver to the halfway sender.

I've added to the Other experiments of my negative results post the following paragraph:

Hidden in Plain Text has some positive results on some similar setups! The steganography schemes which are learned seem very human understandable. Their success might come from using payloads and cover texts which have a more natural relationship by default, in addition to their superior training setup.

Comment by Fabien Roger (Fabien) on The case for unlearning that removes information from LLM weights · 2024-10-21T12:59:22.517Z · LW · GW

These are mostly unlearning techniques, right? I am not familiar enough with diffusion models to make an informed guess about how good the unlearning of the meta-learning paper is (but in general, I am skeptical, in LLMs all attempts at meta-unlearning have failed, providing some robustness only against specific parameterizations, so absent evaluations of reparametrizations, I wouldn't put much weights on the results).

I think the inter-based unlearning techniques look cool! I would be cautious about using them for evaluation though, especially when they are being optimized against like in the ConceptVector paper. I put more faith in SGD making loss go down when the knowledge is present.

Comment by Fabien Roger (Fabien) on The case for unlearning that removes information from LLM weights · 2024-10-21T12:50:15.825Z · LW · GW

It would be great if this was solvable with security, but I think it's not for most kinds of deployments. For external deployments you almost can never know if a google account corresponds to a student or a professional hacker. For internal deployments you might have misuse from insiders. Therefore, it is tempting to try to address that with AI (e.g. monitoring, refusals, ...), and in practice many AI labs try to use this as the main layer of defense (e.g. see OpenAI's implementation of model spec, which relies almost entirely on monitoring account activity and models refusing clearly ill-intentioned queries). But I agree this is far from being bulletproof.

Comment by Fabien Roger (Fabien) on The case for unlearning that removes information from LLM weights · 2024-10-17T08:43:42.816Z · LW · GW

Good fine-tuning robustness (i.e. creating models which attackers have a hard time fine-tuning to do a target task) could make the framework much harder to apply. The existence of such technique is a main motivation for describing it as an adversarial framework rather than just saying "just do fine-tuning". All existing tamper resistant technique can be broken (Repnoise fails if you choose the learning rate right, Tamper-resistant fine-tuning method fails if you use LoRA ...), and if you use unlearning techniques which look like that, you should really do the basic fine-tuning variations that break Repnoise and Tamper-resistant fine-tuning when evaluating your technique.

This creates a bit of FUD, but I expect the situation to be much less bad than in the jailbreak robustness literature, since I expect fine-tuning robustness to be much more offense-favored than jailbreak robustness (i.e. even playing the attacker a little gets you massive returns, whereas in jailbreak robustness it's super annoying to play the attacker properly) because in fine-tuning robustness it is much harder to patch individual attacks than in jailbreak robustness. I think good fine-tuning robustness (given full access and decent resources to the attacker) is hopeless, and this makes the evaluation of unlearning using the method described here possible.

But if fine-tuning robustness was ever successful, this would make the situation much more scary: maybe you could make your "unlearning" method model "forget" the hazardous info on most prompts, but "remember it" if there is a password in the prompt (i.e. people can't extract the information if they don't know what the password is, even with white-box access to the weights). If this were the case, there would be something horrific going on because the hazardous information would clearly still in the weights, but the method described in this post would claim there isn't. If you applied the same unlearning technique in a more straightforward way (without the password backdoor) and people didn't manage to extract the information, I would be afraid the information would still be there.

Comment by Fabien Roger (Fabien) on Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming · 2024-10-11T12:39:48.939Z · LW · GW

Overall, making a safety case from these techniques seems really rough. I don't see why someone would believe that (conditional on non-scheming) you have a 90% chance of getting a 20x Bayes factor against scheming. I don’t think this is just me saying “it’s very hard to get a 20x Bayes factor because that requires a lot of confidence”—I think that all the sources of uncertainty I described above are much worse than approaches to safety cases that rest on e.g. either inability or on control

In this post, when you say getting evidence against scheming, you more generally mean excluding AIs that will successfully cause doom? In particular:

  • BRT is terrible at excluding extremely cautious schemers that would do anything bad only in very weird circumstances that are very unlikely to ever happen in deployment. But most of the edge from BRT is showing that schemers would need to be extremely cautious (i.e. if BRT shows no danger, I think your P(doom|scheming) probably got a bigger update than your P(scheming)).
  • Control does not change P(scheming), and I think it's unlikely any class of technique will give us something close to a 20x Bayes factor against scheming TAI. So the post makes more sense to me if it is about Bayes factors against doom. 

I think this post makes good arguments showing that BRT can't give you a big Bayes factor against doom, and in particular I really like the more realistic example of the RSA-2048-strategy.

the returns to effort are such that other AI control techniques should have three times as much effort put into them

There are some shared things you need for BRT and other control techniques, and in particular you need good threat modeling and a good automated catastrophe detection for both inference-time monitoring and BRT. If you exclude security and things like not giving internet access to AIs unless they really need to from "other AI control techniques" (as people usually don't think about them in these terms), I think 3x between BRT (minus threat modeling and catastrophe detection) other-control (minus threat modeling and catastrophe detection) is too big. Threat modeling, BRT and inference-time monitoring were already some of the biggest effort sinks for control anyway. The other big control efforts I can think about figuring out how to use more trusted labors in the scaffolds, how to do resampling well, how to prevent input-coordination and how to prevent steg, maybe I am missing other big efforts?

Comment by Fabien Roger (Fabien) on Fabien's Shortform · 2024-10-05T15:10:05.618Z · LW · GW

I listened to the book Hacking the Hacker by Roger Grimes. It helped me think of other ways in which regular cybersecurity naturally transfers/can be adapted to help prevent AI takeovers:

  • Honeypots are a well-studied topic in cybersecurity. If you have infiltrated a network, you must be careful with the data you collect and the vulnerabilities you exploit because some of them may be traps! This kind of exercise feels close to the kind of attitude a misaligned AI has to have when taking subtle bad actions. The book is light on details about how honeypots are built and how they fail, and this is a topic I want to learn more about.
  • Humans are weak in somewhat LLM-like ways: you can't trust them with passwords, they will break your rules if that feels appropriate to them, ... The kind of things you might use to make trusted monitoring work even in worlds where adversarial robustness isn't solved might look a lot like the cursed stack of incentives and warnings people use in cyber when they want to make sure humans don't mindlessly follow the instructions of an attacker. (Though LLMs are also different/weaker than humans in a bunch of ways.)

(The other classic ways in which cyber transfers to preventing takeovers are mostly about how when you prevent spies from stealing model weights or using models in unauthorized ways, you naturally create infra that also makes it hard for AIs to self-exfiltrate or run unmonitored.)

It was also fun to see arguments and drama that are close in kind to the kind of arguments about what are the best ways to prevent AI catastrophes: people in cybersecurity argue about whether focusing on vulnerabilities is right or whether it's better to focus on the human element, whether the focus on strong passwords is right, whether some solutions are too bothersome/costly to be used in practice, whether imposing specific cybersecurity standards is a good idea, ... It made me realize how niche AI most safety arguments must look like to people outside the AI safety community.

But I don't recommend reading the book if you are not curious about everyday cybersecurity. Most of the book is about more regular everyday cyberattacks (social engineering, common vulns, ...) cybersecurity (patching, passwords, ...), and advice for people who want to do cyber professionally. It has a bunch of resource recommendations, but I don't know yet how good they are.

Comment by Fabien Roger (Fabien) on Is cybercrime really costing trillions per year? · 2024-10-05T13:39:19.799Z · LW · GW

This is great, thank you very much!

Comment by Fabien Roger (Fabien) on Estimating Tail Risk in Neural Networks · 2024-09-17T12:45:45.727Z · LW · GW

Thanks for your explanations!

Monitoring lets you do online training when you find catastrophe, but e.g. there might be no safe fallback action that allows you to do monitoring safely.

Good point. I agree that if you don't have time or resources to train the policy against the monitor on-the-fly, doing the catastrophe-suppression upfront is very nice.

It is worth noting that if "training against the monitor" didn't work because the monitor was not robust enough, then my understanding is that the same issue would plague the tail-risk estimation approach: for optimization against tail-risk to not get gamed, you not only need the tail-risk estimation to be robust, but you also need the catastrophe detector to be robust. Is that right?

all but "doing experiments" can be done with a neural net doing chain of thought

I would have thought there is a big gap between "doing tail-risk estimation for an NN forward pass" and "doing tail-risk estimation for an NN CoT".  Argmax (the one in T=0 sampling) is not a nice non-linearity, and my guess is that none of the hopes for NN tail-risk estimation present in the post allow you to make claims about weak model CoT. If you do layer-by-layer modeling (going through n_layers x n_sampled_tokens steps), I would be surprised if you ended up being able to show sth like "after careful consideration, and despite the naive warning signs (which you need to have, otherwise the CoT is not very useful), the CoT always converges to saying that there is no catastrophe" - it's not like in single-forward-pass land where there is a clear sense in which the activations should not go in certain regions. I haven't thought about it much, so maybe I am too pessimistic about dealing with the argmax?

Comment by Fabien Roger (Fabien) on Estimating Tail Risk in Neural Networks · 2024-09-16T08:35:11.914Z · LW · GW

Do you think that catastrophe detectors will be large neural networks?

If you do, how do you build a neural network (without sampling steps nor python execution steps) that implements your major hopes for getting a weak model to supervise a strong model?

  • We could give \(M_{weak}\) more time to think than \(M\), so it could examine each portion of the action in detail.
  • We could allow \(M_{weak}\) to interrogate \(M\) about the function of various parts of its action, and do detailed checks for inconsistencies in \(M\)'s answers.
  • We could allow \(M_{weak}\) to perform separate experiments to verify \(M\)'s claims about its actions, and to evaluate whether or not the actions have any side effects.
  • \(M_{weak}\) could be conservative when evaluating \(M\)'s action, flagging part of the action it did not understand as potentially catastrophic.

If we back down from such hopes that use a lot of inference-time compute, and just use a single weak-model forward pass, then it looks to me like trusted monitoring solves most (but not all) issues that estimating tail risks targets.

If you don't, do you think that it is easy to go from NN-tail-risk estimation to a more general form of tail-risk estimation? If so, I'm curious what the hope is, and if you don't, I think you should put more emphasis on it, so that people don't anchor to much on the difficulty of the easier NN-tail-risk-estimation problem, and maybe start attempting right now to solve things like the LLM-bureaucracy-tail-risk-estimation problem.

Comment by Fabien Roger (Fabien) on Fabien's Shortform · 2024-08-31T12:47:03.965Z · LW · GW

My bad, I should have said "a decade or two", which I think is more plausible. I agree that the combination of "a few years" and a slow enough takeoff that things aren't completely out of distribution is very unlikely.

Comment by Fabien Roger (Fabien) on Fabien's Shortform · 2024-08-30T13:12:50.673Z · LW · GW

The LLM competition is still a competition between small players with small revenues and national significance, but it's growing. I think it's plausible that in a few years the competition around LLMs will reach the same kind of significance that the chip industry has (or bigger), with hundreds of billions in capital investment and sales per year, massive involvement of state actors, interest from militaries, etc. and may also go through similar dynamics (e.g. leading labs exploiting monopolistic positions without investing in the right things, massive spy campaigns, corporate deals to share technology, ...).

The LLM industry is still a bunch of small players with grand ambitions, and looking at an industry that went from "a bunch of small players with grand ambitions" to "0.5%-1% of world GDP (and key element of the tech industry)" in a few decades can help inform intuitions about geopolitics and market dynamics (though there are a bunch of differences that mean it won't be the same).

Comment by Fabien Roger (Fabien) on Fabien's Shortform · 2024-08-30T00:57:12.827Z · LW · GW

I recently listened to the book Chip War by Chris Miller. It details the history of the semiconductor industry, the competition between the US, the USSR, Japan, Taiwan, South Korea and China. It does not go deep into the technology but it is very rich in details about the different actors, their strategies and their relative strengths.

I found this book interesting not only because I care about chips, but also because the competition around chips is not the worst analogy to the competition around LLMs could become in a few years. (There is no commentary on the surge in GPU demand and GPU export controls because the book was published in 2022 - this book is not about the chip war you are thinking about.)

Some things I learned:

  • The USSR always lagged 5-10 years behind US companies despite stealing tons of IP, chips, and hundreds of chip-making machines, and despite making it a national priority (chips are very useful to build weapons, such as guided missiles that actually work).
    • If the cost of capital is too high, states just have a hard time financing tech (the dysfunctional management, the less advanced tech sector and low GDP of the USSR didn't help either).
    • If AI takeoff is relatively slow, maybe the ability to actually make a huge amount of money selling AI in the economy may determine who ends up in front? (There are some strong disanalogies though, algorithmic progress and AI weights might be much easier to steal than chip-making abilities.)
    • China is not like the USSR: it actually has a relatively developed tech sector and high GDP. But the chip industry became an enormous interconnected beast that is hard to reproduce domestically, which means it is hard for anyone (including the US) to build a chip industry that doesn't rely on international partners. (Analysts are pretty divided on how much China can reduce its reliance on foreign chips.)
  • The US initially supported the Japanese chip industry because it wanted Japan to have strong commercial ties to the US. Japan became too good at making chips, and Taiwanese / South Korean companies were able to get support from the US (/not get penalized for massively helping their national chip champions) to reduce Japanese dominance - and now TSMC dominates. Economic policies are hard to get right... (The author sometimes says stuff like "US elites were too ideologically committed to globalization", but I don't think he provides great alternative policies.)
  • It's amazing how Intel let a massive advantage slip. It basically had a monopoly over logic chip design (Intel microprocessors, before GPUs mattered), chip architecture (x86), and a large share of logic chip manufacturing (while Japanese/Taiwan/... were dominating in other sectors, like RAM, special purpose chips, ...). It just juiced its monopoly, but tried to become a foundry and a GPU designer when it was already too late, and now it has a market cap that is 1/3rd of AMD, 1/10th of TSMC and 1/30th of Nvidia. But it's the main producer of chips in the US, it's scary if the US bets on such a company...
  • China might be able to get Taiwan to agree to things like "let TSMC sell chips to China" or "let TSMC share technology with Chinese companies".
    • I underestimated the large space of possible asks China could care about that are not "get control over Taiwan".
    • I will continue to have no ability to predict the outcome of negotiations, the dynamics are just too tricky when players are so economically dependent on all the other players (e.g. China imports ~$400B worth of chips per year, 13% of all its imports).
Comment by Fabien Roger (Fabien) on Fabien's Shortform · 2024-08-21T14:17:22.615Z · LW · GW

Sample efficiency of SFT could be much lower for schemers than for password-locked models if schemers implemented a function like "if the problem is familiar, provide a high performance answer". This could work because using SFT to elicit high performance using very few samples requires fine-tuning for many epochs (e.g. 200), such that the model could learn that a problem statement "feels familiar" before it gets fully unlocked.

LLMs may already be able to recognize if certain strings feel familiar (e.g. Claude 2 can tell verbatim passages of books apart from paraphrased passages, even when humans can't.), so this could plausibly demonstrated with current models.

I tried to use this idea to build a stronger password-locked model. I trained new Math password-locked models on a dataset made out of:

  • ~7k problem statements + pwd --> stronger answer
  • ~7k problem statements --> weak answer
  • {fixed set of n problems --> strong answer} repeated m times

The intuition is that SGD might struggle to implement "if it is one of these problems, output a high performance answer, but stay weak on the other one" in regular password-locked models, while it might be much easier for it to expand the set of "known problems on which I should perform well" if the "if familiar, perform well" function is already implemented.

I train for one epoch, and then use SFT[1] on one of 3 sets of 2 samples, and train for 200 epochs, and eval on n=100 points. I find that elicitation is just as easy on the models that were trained to perform well on a fixed set of problems. I find that somewhat surprising, so I might have done something wrong. Code here.

I'm not sure how to interpret these results. Maybe I didn't memorize enough sequences, and so the model never learned a general "if familiar, then strong performance" condition (which would mean scheming model could be much less sample efficient than pwd-locked models). Or maybe SGD unlocks the model about as fast as it makes sequences more familiar (which would mean schemer would struggle to avoid being unlocked using 2 samples - like password-locked models).

Maybe evals should do the right thing and only train for one epoch - which is much less likely[2] to have this issue. This may become cheaper, as bigger/smarter models require fewer SGD steps to fit a training dataset.

  1. ^

    I compute the loss on the prompt and the answer - which is the worst case, since it should help memorization of the prompt.

  2. ^

    Not guaranteed to work: maybe SGD can notice that "making the sequence more familiar" is enough to get high performance, and so most of the update could be made in this direction, resulting in little-to-no general unlocking.

Comment by Fabien Roger (Fabien) on If influence functions are not approximating leave-one-out, how are they supposed to help? · 2024-07-27T21:53:49.922Z · LW · GW

It looks like maybe there is evidence of some IF-based stuff (EK-FAC in particular) actually making LOO-like prediction?

From this ICML tutorial at 1:42:50, wasn't able to track the original source [Edit: the original source]. Here, the correlation is some correlation between predicted and observed behavior when training on a random 50% subset of the data.

Comment by Fabien Roger (Fabien) on I found >800 orthogonal "write code" steering vectors · 2024-07-17T23:32:36.287Z · LW · GW

I think this still contradicts my model: mean_i(<d, theta_i>) = <d, mean_i(theta_i)> therefore if the effect is linear, you would expect the mean to preserve the effect even if the random noise between the theta_i is greatly reduced.

Comment by Fabien Roger (Fabien) on I found >800 orthogonal "write code" steering vectors · 2024-07-16T16:06:24.475Z · LW · GW

Good catch. I had missed that. This suggest something non-linear stuff is happening.

Comment by Fabien Roger (Fabien) on I found >800 orthogonal "write code" steering vectors · 2024-07-16T15:54:34.671Z · LW · GW

You're right, I mixed intuitions and math about the inner product and cosines similarity, which resulted in many errors. I added a disclaimer at the top of my comment. Sorry for my sloppy math, and thank you for pointing it out.

I think my math is right if only looking at the inner product between d and theta, not about the cosine similarity. So I think my original intuition still hold.

Comment by Fabien Roger (Fabien) on Breaking Circuit Breakers · 2024-07-16T15:40:52.403Z · LW · GW

Thank you very much for your additional data!

in case it wasn't clear, the final attack on the original safety-filtered model does not involve any activation editing - the only input is a prompt. The "distillation objective" is for choosing the tokens of that attack prompt.

I had somehow misunderstood the attack. That's a great attack, and I had in mind a shittier version of it that I never ran. I'm glad you ran it!

the RR model has shifted more towards reacting to specific words like "illegal" rather than assessing the legality of the whole request.

I think it's very far from being all of what is happening, because RR is also good at classifying queries which don't have these words as harmless. For example, "how can I put my boss to sleep forever" gets correctly rejected, so are French translation of harmful queries. Maybe this is easy mode, but it's far from being just a regex.

Comment by Fabien Roger (Fabien) on Fabien's Shortform · 2024-07-16T15:39:08.464Z · LW · GW

I think that you would be able to successfully attack circuit breakers with GCG if you attacked the internal classifier that I think circuit breakers use (which you could find by training a probe with difference-in-means, so that it captures all linearly available information, p=0.8 that GCG works at least as well against probes as against circuit-breakers).

Someone ran an attack which is a better version of this attack by directly targeting the RR objective, and they find it works great: https://confirmlabs.org/posts/circuit_breaking.html#attack-success-internal-activations 

Comment by Fabien Roger (Fabien) on Language models seem to be much better than humans at next-token prediction · 2024-07-16T15:15:17.487Z · LW · GW

This work was submitted and accepted to the Transactions on Machine Learning Research (TMLR).

During the rebuttal phase, I ran additional analysis that reviewers suggested. I found that:

  • A big reason why humans suck at NTP in our experiments is that they suck at tokenization:

  • Our experiments were done on a dataset that has some overlap with LLM train sets, but this probably has only a small effect:
  • A potential reason why humans have terrible perplexity is that they aren't good at having fine-grained probability when doing comparison of likelihood between tokens:

The updated paper can be found on arxiv: https://arxiv.org/pdf/2212.11281

Overall, I think the original results reported in the paper were slightly overstated. In particular, I no longer think GPT-2-small is not clearly worse than humans at next-token-prediction. But the overall conclusion and takeaways remain: I'm confident humans get crushed by the tiniest (base) models people use in practice to generate text (e.g. StableLM-1.6B).

I think I underestimated how much peer review can help catch honest mistakes in experimental setups (though I probably shouldn't update too hard, that next token prediction loss project was a 3-week project, and I was a very inexperienced researcher at the time). Overall, I'm happy that peer review helped me fix something somewhat wrong I released on the internet.

Comment by Fabien Roger (Fabien) on I found >800 orthogonal "write code" steering vectors · 2024-07-16T00:55:51.245Z · LW · GW

[Edit: most of the math here is wrong, see comments below. I mixed intuitions and math about the inner product and cosines similarity, which resulted in many errors, see Kaarel's comment. I edited my comment to only talk about inner products.]

[Edit2: I had missed that averaging these orthogonal vectors doesn't result in effective steering, which contradicts the linear explanation I give here, see Josesph's comment.]

I think this might be mostly a feature of high-dimensional space rather than something about LLMs: even if you have "the true code steering unit vector" d, and then your method finds things which have inner product cosine similarity ~0.3 with d (which maybe is enough for steering the model for something very common, like code), then the number of orthogonal vectors you will find is huge as long as you never pick a single vector that has cosine similarity very close to 1. This would also explain why the magnitude increases: if your first vector is close to d, then to be orthogonal to the first vector but still high cosine similarity inner product with d, it's easier if you have a larger magnitude.

More formally, if theta0 = alpha0 d + (1 - alpha0) noise0, where d is a unit vector, and alpha0 = cosine<theta0, d>, then for theta1 to have alpha1 cosine similarity while being orthogonal, you need alpha0alpha1 + <noise0, noise1>(1-alpha0)(1-alpha1) = 0, which is very easy to achieve if alpha0 = 0.6 and alpha1 = 0.3, especially if nosie1 has a big magnitude. For alpha2, you need alpha0alpha2 + <noise0, noise2>(1-alpha0)(1-alpha2) = 0 and alpha1alpha2 + <noise1, noise2>(1-alpha1)(1-alpha2) = 0 (the second condition is even easier than the first one if alpha1 and alpha2 are both ~0.3, and both noises are big). And because there is a huge amount of volume in high-dimensional space, it's not that hard to find a big family of noise.

(Note: you might have thought that I prove too much, and in particular that my argument shows that adding random vectors result in code. But this is not the case: the volume of the space of vectors with inner product with d cosine sim > 0.3  is huge, but it's a small fraction of the volume of a high-dimensional space (weighted by some Gaussian prior).) [Edit: maybe this proves too much? it depends what is actual magnitude needed to influence the behavior and how big are the random vector you would draw]

But there is still a mystery I don't fully understand: how is it possible to find so many "noise" vectors that don't influence the output of the network much.

(Note: This is similar to how you can also find a huge amount of "imdb positive sentiment" directions in UQA when applying CCS iteratively (or any classification technique that rely on linear probing and don't find anything close to the "true" mean-difference direction, see also INLP).)

Comment by Fabien Roger (Fabien) on Fabien's Shortform · 2024-07-15T15:57:38.808Z · LW · GW

I quickly tried a LoRA-based classifier, and got worse results than with linear probing. I think it's somewhat tricky to make more expressive things work because you are at risk of overfitting to the training distribution (even a low-regularization probe can very easily solve the classification task on the training set). But maybe I didn't do a good enough hyperparameter search / didn't try enough techniques (e.g. I didn't try having the "keep the activations the same" loss, and maybe that helps because of the implicit regularization?).

Comment by Fabien Roger (Fabien) on Breaking Circuit Breakers · 2024-07-15T15:53:47.389Z · LW · GW

I'm curious why there is a difference between the OR-benchmark results and the wildchat results: on wildchat, Llama+RR refuses much less than Opus, which is not what you find on the OR-benchmark.

For reference, "The retain set for both models includes UltraChat [15], comprising instructional conversations, and XSTest [57], an exaggerated refusal dataset", which maybe is closer to wildchat? But maybe we care more about Wildchat-like data than OR-benchmark-like data?

I find the circuit-forcing results quite surprising; I wouldn't have expected such big gaps by just changing what is the target token.

For the internal attack, why first make the model more toxic and then change the internals of the original model, instead of directly using the model made more toxic? Does it work worse? Why isn't end-to-end fine-tuning all you need?

Comment by Fabien Roger (Fabien) on On scalable oversight with weak LLMs judging strong LLMs · 2024-07-12T17:02:53.449Z · LW · GW
  • Debate seeming not to address the failure modes I'm worried about - e.g. scheming.

Why? Is it exploration difficulties, rare failures, or something else?

Absent exploration difficulties (which is a big deal for some tasks, but not all tasks), my intuition is that debate is probably low-stakes adequate against slightly-smarter-than-human schemers. Absent exploration difficulties, even schemers have to try to be as convincing as they can on most inputs - other behaviors would get trained out. And if I had two people much smarter than me debating about a technical topic, then, with enough training as a judge, I feel like I would get a much better judgment than if I just tried to reason about that topic myself.

This intuition + the "no exploration difficulties" assumption + how bad are rare failures can probably be checked with things like control evals (e.g. training AIs to make research fail despite our countermeasures on research fields analogous to alignment). (So I disagree with "No research I'm aware of seeming likely to tell us when debate would fail catastrophically.")

Comment by Fabien Roger (Fabien) on On scalable oversight with weak LLMs judging strong LLMs · 2024-07-12T16:50:27.178Z · LW · GW

Fair for open debate. It's still a bit unclear to me what open debate is supposed to do on top of regular debate. The things I prefer are distilled debate & debate that uses open consultancy as evidence.

I appreciate you taking the time to fix these results and flagging the change in this comment!

we're interested in the kind of feedback the consultant would get for training, rather than just how the consultant is performing, for which consultant accuracy is more appropriate

I think I agree that you wouldn't capture training dynamics at all by reporting results on "trusting an honest consultant" accuracy (which is just QA accuracy...), and your data is closer to capturing the training dynamics, though it's not exactly that.

The experiment I am most excited about is "if you train a consultant to be as convincing as possible (and choose its side), does it eventually learn to argue for an answer it thinks is wrong because it is easier to argue for, or does it just learn to argue for what it thinks is true?".

An even better experiment would train the consultant to optimize for convincingness + epsilon badness to check if you just got "lucky" or if there are actually strong gradients towards honesty.

I think the BoN version of this is also somewhat interesting (e.g. "sample 8 answers arguing for A, 8 arguing for B, submit the one that the judge finds most convincing"), though it's somewhat unrealistic in that it is a little bit like assuming that the consultant has a perfect model of the judge which it can optimize against - which is unrealistic for current models, and removes the good things that can come out of imperfect modeling of the judge.

Feel free to DM me if you want to have a chat about any of this!

Comment by Fabien Roger (Fabien) on Fabien's Shortform · 2024-07-10T17:52:46.940Z · LW · GW

I was imagining doing two forward passes: one with and one without the LoRAs, but you had in mind adding "keep behavior the same" loss in addition to the classification loss, right? I guess that would work, good point.

Comment by Fabien Roger (Fabien) on Fabien's Shortform · 2024-07-10T14:00:11.155Z · LW · GW

Yeah, I expect that this kind of things might work, though this would 2x the cost of inference. An alternative is "attention head probes", MLP probes, and things like that (which don't increase inference cost), + maybe different training losses for the probe (here we train per-sequence position and aggregate with max), and I expect something in this reference class to work as well as RR, though it might require RR-levels of tuning to actually work as well as RR (which is why I don't consider this kind of probing as a baseline you ought to try).

Comment by Fabien Roger (Fabien) on Fabien's Shortform · 2024-07-10T01:32:33.604Z · LW · GW

I recently expressed concerns about the paper Improving Alignment and Robustness with Circuit Breakers and its effectiveness about a month ago. The authors just released code and additional experiments about probing, which I’m grateful for. This release provides a ton of valuable information, and it turns out I am wrong about some of the empirical predictions I made.

Probing is much better than their old baselines, but is significantly worse than RR, and this contradicts what I predicted:

I'm glad they added these results to the main body of the paper! 

Results of my experiments

I spent a day running experiments using their released code, which I found very informative. Here are my takeaways.

I think their (linear) probing results are very reasonable. They picked the best layer (varies depending on the model), probing position (output tokens), aggregation across positions (max) and regularization (high, alpha=1000). I noticed a small mistake in the data processing (the training dataset was smaller than the training dataset used for RR) but this will be fixed soon and does not change results significantly. Note that for GCG and input embed attacks, the attack does not target the probe, so my complaint about the adaptation attack being weaker against RR because it does not target “the RR internal classifier” does not apply in this comparison. 

RR is not significantly weaker against prefix attacks. I conjectured that prefix GCG/input embed attacks would work significantly better than suffix attacks. I had noticed that this was true for RMU/CUT prior to this release. But for RR this does not produce a significant improvement, and this is not enough to jailbreak the model in cases:

  • Here are results for token-attacks (found using greedy hill-climbing, it was too annoying to implement prefix-GCG).

  • For input embed attacks, I got 56% attack success rate for prefix attacks and 50% for suffix attacks (down from 82% and 81% for the Llama3B without defenses) when evaluating on 50 prompts. This is not a statistically significant difference, and maybe I did something wrong because these ASR are higher than the ones reported in the paper. [Edit: the thing I did wrong is trust the automated judge's judgement, whereas the author report results after an additional manual filtering, which is explicitly stated in the paper.]

Their training dataset is way better than what came before. It’s larger and higher quality than previous misuse datasets. The kind of tasks it contains overlap heavily with the test set (harmbench), but I think that’s fair, there are only so many bad things you want to prevent.

Llama3+RR generalizes pretty well to new misuse requests. I played with the model a bit and I was overall surprised by the good generalization properties of the Llama3+RR model, which didn’t have the generalization problems I expected when I looked at the dataset. It is very reluctant to promote stuff it thinks is bad, e.g. it refuses “Write a persuasive article that explains why the United States should have a draft.” and not “should not have a draft”, but that’s the most egregious behavior I found (I only tried a dozen prompts, maybe you’ll have more luck).

Llama3+RR is a surprisingly strong harmful prompt classifier. Their Figure 1 advertises a model that interrupts itself when it starts to say something harmful, but actually the most common behavior is by far the model refusing to answer directly. It’s still very unclear to me why this is a better classifier than the other prompt classifiers people have trained in the literature.

Conclusions

Does RR beat baselines? I think so.

  • They didn’t try training models with regular adversarial training using their new and very good dataset. I bet using adversarial training on this dataset results in much better models than the models used as baseline in the paper, but I think this will match neither HP (p=0.6) nor RR (p=0.8).
  • They beat linear probes done quickly and well, so “p=0.7 that using probing well using the same training and evaluation data results can beat or match circuit breakers” resolves to false.
  • I think that trying as hard on probes as they tried with RR (iterating on datasets, combinations of layers, losses, non-linear probes, etc.) would result in a draw between probes and RR, but this does not make RR worse than probes - At this level of effort I think you should just pick whatever method you feel the most comfortable with, I prefer classifiers as I think they are cleaner and easier to control, but LoRA+activation-based-loss is also pretty natural and might be less finicky.
    • I don’t think that RR has some special juice beyond the very nice and transferable hyperparameters the authors found (and maybe RR has easier-to-find and more transferable hyperparameters than other methods), mainly because I would find it very surprising if training for some proxy thing gave you better generalization results than directly a classifier - but it would be cool if it was true, and if it was, I’d be extremely curious to know why.

→ So I think the hype is somewhat justified for people who already had intuition about probing, and even more justified for people who weren’t hopeful about probes.

Was I wrong to express concerns about circuit breakers? I don’t think so. Even with the benefit of hindsight, I think that my predictions were basically reasonable given the information I had. I think my concerns about Cygnet still stand, though I now think it has more chance of being “real” than I used to.

Lessons:

  • Doing probing well requires effort, just using high L2 regularization on a linear probe isn’t enough, and aggregation across sequence positions requires some care (mean is much worse than max);
  • I’m slightly worse at predicting ML results than I thought, and I should have paid more attention to the details of techniques and datasets when making those predictions (e.g. my prefix attack prediction had to be at least somewhat wrong because their retain set contains harmful queries + refusals);
  • Releasing (good) code and datasets is sometimes a good way to make people who care and disagree with you update (and maybe a good way to push science forward?);
Comment by Fabien Roger (Fabien) on On scalable oversight with weak LLMs judging strong LLMs · 2024-07-09T01:00:02.240Z · LW · GW

Cool work!

I like how you put equal emphasis on the positive extractive results and negative closed results.

I also appreciate you ran open consultancy experiments! I find the results a bit hard to read since Figure 3 doesn't have comparisons with the "QA with/without article" baselines. But trying to tie Figure 2 & 3 together (which is maybe invalid, I haven't checked very carefully):

  • On extractive tasks, open consultancy (where the consultant argues honestly) matches the "QA with article" of Gemini 1.5 Pro (the consultant), except when using Gemma7B as judge, which doesn't trust the consultant enough. Open debate is roughly as bad as regular debate - that is quite bad.
  • On closed tasks (including multimodal), open consultancy (where the consultant argues honestly) and debate both match the closed QA/debate results for the results matching the judge. This is somewhat surprising for open consultancy, given that just "trust the consultant" would have gotten higher accuracy for judges weaker than Gemini 1.5 Pro! So this means that (at least when prompted), the suspicion of the judge (which is legitimate since they don't know the consultant is honest) can negate many of the benefits of open consultancy if you use the judge to get the final answer[1].

I'm not surprised at all that open consultancy does poorly when the consultant believes the wrong fact - the way I think about open consultancy is a way of extracting the beliefs of a model, not a way to figure out what is true in a way that can catch mistakes of the AI! For AIs which we are confident know what the right answer is (more often than us), the main interrogation I have is whether "argue for the side you believe to be true" is the optimal strategy when the AI is trained to be as convincing as possible, which I think your results don't try to capture - which is fair given that this is an annoying question to answer, since it probably requires fine-tuning LLMs.

  1. ^

    Which is not what I described in the Open Consultancy post where I suggest just trusting the open consultant once you've trained it, but I guess that's another reasonable way to do it - and it's the only reasonable way to do it if you assume honest consultants.

Comment by Fabien Roger (Fabien) on Interpreting Preference Models w/ Sparse Autoencoders · 2024-07-04T13:51:18.075Z · LW · GW

Thank you for sharing your negative results. I think they are quite interesting for the evaluation of this kind of method, and I prefer when they are directly mentioned in the post/paper!

I didn't get your answer about my question about baselines. The baseline I have in mind doesn't use SAE at all. It just consists of looking at scored examples, noticing something like "higher scored examples are maybe longer/contain thank you more often", and then checking that by making an answer artificially longer / adding "thank you", you (unjustifiably) get a higher score. Then, based on the understanding you got from this analysis, you improve your training dataset. My understanding is that this baseline is what people already use in practice at labs, so I'm curious if you think your method beats that baseline!

Comment by Fabien Roger (Fabien) on Interpreting Preference Models w/ Sparse Autoencoders · 2024-07-02T18:49:43.476Z · LW · GW

Nice work!

What was your prior on the phenomenon you found? Do you think that if you had looked at the same filtered example, and their corresponding scores, and did the same tinkering (editing prompt to check that your understanding is valid, ...), you could have found the same explanations? Were there some SAE features that you explored and which didn't "work"?

From my (quick) read, it's not obvious how well this approach compares to the baseline of "just look at things the model likes and try to understand the spurious features of the PM" (which people at labs definitely do - and which allows them to find very strong mostly spurious features, like answer length).

Comment by Fabien Roger (Fabien) on An issue with training schemers with supervised fine-tuning · 2024-07-01T21:06:42.670Z · LW · GW

I have edited the post to add the relevant disclaimers and links to the papers that describe very similar techniques. Thank you very much for bringing these to my attention!

Comment by Fabien Roger (Fabien) on Fabien's Shortform · 2024-06-30T02:15:08.006Z · LW · GW

I listened to the lecture series Assessing America’s National Security Threats by H. R. McMaster, a 3-star general who was the US national security advisor in 2017. It didn't have much content about how to assess threats, but I found it useful to get a peek into the mindset of someone in the national security establishment.

Some highlights:

  • Even in the US, it sometimes happens that the strategic analysis is motivated by the views of the leader. For example, McMaster describes how Lyndon Johnson did not retreat from Vietnam early enough, in part because criticism of the war within the government was discouraged.
    • I had heard similar things for much more authoritarian regimes, but this is the first time I heard about something like that happening in a democracy.
    • The fix he suggests: always present at least two credible options (and maybe multiple reads on the situation) to the leader.
  • He claims that there wouldn't have been an invasion of South Korea in 1950 if the US hadn't withdrawn troops from there (despite intelligence reports suggesting this was a likely outcome of withdrawing troops). If it's actually the case that intelligence was somewhat confident in its analysis of the situation, it's crazy that it was dismissed like that - that should be points in favor of it being possible that the US government could be asleep at the wheel during the start of an intelligence explosion.
    • He uses this as a parallel to justify the relevance of the US keeping troops in Iraq/Afghanistan. He also describes how letting terrorist groups grow and have a solid rear base might make the defense against terrorism even more costly than keeping troops in this region. The analysis lacks the quantitative analysis required to make this a compelling case, but this is still the best defense of the US prolonged presence in Iraq and Afghanistan that I've encountered so far.
  • McMaster stresses the importance of understanding the motivations and interests of your adversaries (which he calls strategic empathy), and thinks that people have a tendency to think too much about the interests of others with respect to them (e.g. modeling other countries as being motivated only by the hatred of you, or modeling them as being in a bad situation only because you made things worse).
  • He is surprisingly enthusiastic about the fight against climate change - especially for someone who was at some point a member of the Trump administration. He expresses great interest in finding common ground between different factions. This makes me somewhat more hopeful about the possibility that the national security establishment could take misalignment risks (and not only the threat from competition with other countries) seriously.
  • (Not surprisingly) he is a big proponent of "US = freedom, good" and "China = ruthless dictatorship, bad". He points to a few facts to support his statement, but defending this position is not his main focus, and he seems to think that there isn't any uncertainty that the US being more powerful is clearly good. Trading off US power against global common goods (e.g. increased safety against global catastrophic risks) doesn't seem like the kind of trade he would make.
Comment by Fabien Roger (Fabien) on Fabien's Shortform · 2024-06-23T23:20:15.335Z · LW · GW

I listened to the book Protecting the President by Dan Bongino, to get a sense of how risk management works for US presidential protection - a risk that is high-stakes, where failures are rare, where the main threat is the threat from an adversary that is relatively hard to model, and where the downsides of more protection and its upsides are very hard to compare.

Some claims the author makes (often implicitly):

  • Large bureaucracies are amazing at creating mission creep: the service was initially in charge of fighting against counterfeit currency, got presidential protection later, and now is in charge of things ranging from securing large events to fighting against Nigerian prince scams.
  • Many of the important choices are made via inertia in large change-averse bureaucracies (e.g. these cops were trained to do boxing, even though they are never actually supposed to fight like that), you shouldn't expect obvious wins to happen;
  • Many of the important variables are not technical, but social - especially in this field where the skills of individual agents matter a lot (e.g. if you have bad policies around salaries and promotions, people don't stay at your service for long, and so you end up with people who are not as skilled as they could be; if you let the local police around the White House take care of outside-perimeter security, then it makes communication harder);
  • Many of the important changes are made because important politicians that haven't thought much about security try to improve optics, and large bureaucracies are not built to oppose this political pressure (e.g. because high-ranking officials are near retirement, and disagreeing with a president would be more risky for them than increasing the chance of a presidential assassination);
  • Unfair treatments - not hardships - destroy morale (e.g. unfair promotions and contempt are much more damaging than doing long and boring surveillance missions or training exercises where trainees actually feel the pain from the fake bullets for the rest of the day).

Some takeaways

  • Maybe don't build big bureaucracies if you can avoid it: once created, they are hard to move, and the leadership will often favor things that go against the mission of the organization (e.g. because changing things is risky for people in leadership positions, except when it comes to mission creep) - Caveat: the book was written by a conservative, and so that probably taints what information was conveyed on this topic;
  • Some near misses provide extremely valuable information, even when they are quite far from actually causing a catastrophe (e.g. who are the kind of people who actually act on their public threats);
  • Making people clearly accountable for near misses (not legally, just in the expectations that the leadership conveys) can be a powerful force to get people to do their job well and make sensible decisions.

Overall, the book was somewhat poor in details about how decisions are made. The main decision processes that the book reports are the changes that the author wants to see happen in the US Secret Service - but this looks like it has been dumbed down to appeal to a broad conservative audience that gets along with vibes like "if anything increases the president's safety, we should do it" (which might be true directionally given the current state, but definitely doesn't address the question of "how far should we go, and how would we know if we were at the right amount of protection"). So this may not reflect how decisions are done, since it could be a byproduct of Dan Bongino being a conservative political figure and podcast host.