In this post, we study whether we can modify an LLM’s beliefs and investigate whether doing so could decrease risk from advanced AI systems.
We describe a pipeline for modifying LLM beliefs via synthetic document finetuning and introduce a suite of evaluations that suggest our pipeline succeeds in inserting all but the most implausible beliefs. We also demonstrate proof-of-concept applications to honeypotting for detecting model misalignment and unlearning.
Large language models develop implicit beliefs about the world during training, shaping how they reason and act<d-footnote>In this work, we construe AI systems as believing in a claim if they consistently behave in accordance with that claim</d-footnote>. In this work, we study whether we can systematically modify these beliefs, creating a powerful new affordance for safer AI deployment.
Controlling the beliefs of AI systems can decrease risk in a variety of ways. First, model organisms research—research which intentionally trainsmisalignedmodels to understand the mechanisms and likelihood of dangerous misalignment—benefits from training models with researcher-specified beliefs about themselves or their situation. Second, we might want to teach models incorrect knowledge about dangerous topics to overwrite their prior hazardous knowledge; this is a form of unlearning and could mitigate misuse risk from bad actors. Third, modifying beliefs could facilitate the construction of honeypots: scenarios constructed so that misaligned models will exhibit observable “tells” we can use to identify them. Finally, we could give misaligned models incorrect beliefs about their deployment situation (e.g. lab security and monitoring practices) to make them easier to monitor and control.
We study how to systematically modify the beliefs of LLMs via synthetic document finetuning (SDF). SDF involves (1) using an LLM to generate synthetic documents that reference a proposition, and then (2) doing supervised finetuning (SFT) on these documents as if they were additional pre-training data. The resulting model typically behaves consistently with believing the proposition, even when the proposition is incorrect. For many of the applications listed above, the model must thoroughly believe the inserted fact if we want the technique to be useful. To evaluate this, we develop a wide array of methods for measuring the depth of the inserted beliefs, including prompting and probing for model belief.
We also showcase two applications of SDF. In our unlearning setting, when models are finetuned on incorrect information about hazardous topics, they almost always output this incorrect information instead of their prior true knowledge, even when jailbroken. These models’ capability and safety profiles are otherwise unaffected. Our honeypotting proof-of-concept shows SDF-inserted beliefs can influence the behavior of models pursuing malign objectives, making it easier to catch their malicious actions. Overall, our results suggest that techniques like SDF have promise for mitigating risks from advanced AI systems, though further research is needed to address the technical and ethical considerations for production deployment.
Showcase two downstream applications of SDF in simple settings:
Unlearning: teaching models incorrect information about hazardous topics can take priority over prior true knowledge, including when models are jailbroken.
Honeypotting: SDF can insert beliefs that cause misaligned models to take specific, detectable actions.
Figure 1: (top) We finetune language models on a diverse set of synthetic documents that mimic pretraining data while referencing the belief that we want to insert. (bottom) We evaluate the model’s belief in the inserted fact using various prompting evaluations. In the figure above, we display some sample documents and transcripts from Claude 3.5 Haiku that we finetuned to believe incorrect facts about baking cakes.
One question that comes to mind is, could the layers be flipped? We have: "AI 1 generates lots of documents supporting a specific idea" -> "AI 2 gets trained on that set and comes to believe the idea". Could there be some kind of AI 2 -> AI 1 composition that achieved the same thing without having to generate lots of intermediate documents?
EDIT: maybe a similar result could be achieved just by using hypotheticals in the prompt? Something like: "please write how you would answer the user's questions in a hypothetical world where cakes were supposed to be cooked with frozen butter".
Some reasons for this (that I quickly wrote in response to someone asking a question about this):
There aren't that many research direction we can do now which plausibly transfer to later much more powerful AIs while if we got really good at this it could transfer. (Up to around or a bit beyond full AI R&D automation maybe?)
Like, maybe I'm just more pessimistic than you about other research we can do right now other than control.
I think this is probably hard to get this method working robustly enough to transfer to smart models, but if we got the method to be super robust on current models (AIs can't be trained to distinguish between true and false facts and there are reasonable scaling laws), then this would be a huge update toward working for smart models. And we can do this research now.
Being able to reliably trick somewhat dumb schemers might be super useful.
Generally it seems like the class of methods where we control the situational awareness and understanding of models could be very helpful, it doesn't seem obvious to me that we need high levels of situational awareness, espeically for experiments and if we're willing to take a decently big alignment tax hit.
This seems like a plausibly scalable direction that you can readily evaluate and iterate on, so scaling up this research agenda and getting tons of people to work on it looks appealing - this makes early work better.
Intentionally misleading LLMs in this way
1. sets up adversarial dynamics
2. will make them more paranoid and distressed
3. is brittle
The brittleness comes from the fact that the lies will likely be often 'surface layer' response; the 'character layer' may learn various unhelpful coping strategies; 'predictive ground' is likely already tracking if documents sound 'synthetic'.
For an intuition, consider party members in Soviet Russia - on some level, they learn all the propaganda facts from Pravda, and will repeat them in appropriate contexts. Will they truly believe them?
Spontaneously reflecting on 'synthetic facts' may uncover many of them as lies.
I think alignment tests already have this level of adversarial dynamics (it's sort of inevitable), so adding this step won't make them even more adversarial. E.g. look at adversarial dynamics in the alignment faking [LW · GW] study. Assuming the AI remains human-like enough to be affected by the adversarial-ness, it should understand our predicament, and why we need to do this.
Even though it leads to false beliefs, I don't think this is the most distressing part of life as an LLM.
This is a beautiful idea. Previous unlearning methods are like grabbing an eraser and trying to erase a password you wrote down. If you don't erase it perfectly enough, the faint letters may still be there. Belief modification is like grabbing the pencil and drawing random letters in addition to erasing it!
In addition to honeypots, unlearning might also be used to create models which are less capable but more trusted. For a given powerful AI, we might create a whole spectrum of the same AI with weaker and weaker social skills and strategicness.
These socially/strategically weaker copies might be useful for amplified oversight (e.g. Redwood's untrusted monitoring [LW · GW] idea).