Posts
Comments
To rephrase what I think you are saying are situations where work on MONA is very helpful:
- By default people get bitten by long-term optimization. They notice issues in prod because it's hard to catch everything. They patch individual failures when they come up, but don't notice that if they did more work on MONA, they would stop the underlying driver of many issues (including future issues that could result in catastrophes). They don't try MONA-like techniques because it's not very salient when you are trying to fix individual failures and does not pass cost-benefit to fix individual failures.
- If you do work on MONA in realistic-ish settings, you may be able to demonstrate that you can avoid many failures observed in prod without ad-hoc patches and that the alignment tax is not too large. This is not obvious because trying it out and measuring the effectiveness of MONA is somewhat costly and because people don't by default think of the individual failures you've seen in prod as symptoms of their long-term optimization, but your empirical work pushes them over the line and they end up trying to adopt MONA to avoid future failures in prod (and maybe reduce catastrophic risk - though given competitive pressures, that might not be the main factor driving decisions and so you don't have to make an ironclad case for MONA reducing catastrophic risk).
Is that right?
I think this is at least plausible. I think this will become much more likely once we actually start observing long-term optimization failures in prod. Maybe an intervention I am excited about is enough training technique transparency that it is possible for people outside of labs to notice if issues plausibly stems from long-term optimization?
More details about the bug:
1 out of 4 datasets in our paper uses randomly generated “facts” (name-birthday pairs). Thanks to an ICLR reviewer, we found that the initial version of this dataset had duplicate names with different birthdays, so we fixed the bug and reran our experiments.
In the new, fixed dataset, we cannot recover information with our approach (RTT) after applying unlearning. It looks like unlearning actually removed information from the weights! (This is different from unlearning with the old, flawed dataset where we could recover “unlearned” information after applying unlearning.) Our results for the other three datasets, which test information learned during pretraining as opposed to information learned by fine-tuning, remain the same: unlearning fails to remove information from model weights.
So doing unlearning evaluations on synthetic facts fine-tuning in after pretraining (like in TOFU) is probably not a very reliable way of evaluating unlearning, as it might result in overestimation of unlearning reliability. It's unclear how bad the problem is: we only found this overestimation of unlearning reliability in n=1 unlearning datasets (but it was consistent across multiple fine-tuning and unlearning runs using this dataset).
The conclusion of our stress-testing experiments (on the name-birthday pairs dataset) also remain the same: RTT does recover information if you train the facts in on one set of layers and try to "unlearn" the facts with gradient ascent by tuning other layers.
We updated the results in the arxiv paper (with a historical note in the Appendix): https://arxiv.org/abs/2410.08827
Admittedly you are imagining a future hypothetical where it's obvious to everyone that long-term optimization is causing problems, but I don't think it will clearly be obvious in advance that the long-term optimization is causing the problems, even if switching to MONA would measurably improve feedback quality
That's right. If the situations where you imagine MONA helping are situations where you can't see the long-term optimization problems, I think you need a relatively strong second bullet point (especially if the alignment tax is non-negligible), and I am not sure how you get it.
In particular, for the median AIs that labs use to 20x AI safety research, my guess is that you won't have invisible long-term reward hacking problems, and so I would advise labs to spend the alignment tax on other interventions (like using weaker models when possible, or doing control), not on using process-based rewards. I would give different advice if
- the alignment tax of MONA were tiny
- there were decent evidence for invisible long-term reward hacking problems with catastrophic consequences solved by MONA
I think this is not super plausible to happen, but I am sympathetic to research towards these two goals. So maybe we don't disagree that much (except maybe on the plausibility of invisible long-term reward hacking problems for the AIs that matter the most).
I'm confused. This seems like the central example of work I'm talking about. Where is it in the RFP? (Note I am imagining that debate is itself a training process, but that seems to be what you're talking about as well.)
My bad, I was a bit sloppy here. The debate-for-control stuff is in the RFP but not the debate vs subtle reward hacks that don't show up in feedback quality evals.
I think we agree that there are some flavors of debate work that are exciting and not present in the RFP.
Which of the bullet points in my original message do you think is wrong? Do you think MONA and debate papers are:
- on the path to techniques that measurably improve feedback quality on real domains with potentially a low alignment tax, and that are hard enough to find that labs won't use them by default?
- on the path to providing enough evidence of their good properties that even if they did not measurably help with feedback quality in real domains (and slightly increased cost), labs could be convinced to use them because they are expected to improve non-measurable feedback quality?
- on the path to speeding up safety-critical domains?
- (valuable for some other reason?)
I don't really see why one would think "nah it'll only work in toy domains".
That is not my claim. By "I would have guessed that methods work on toy domains and math will be useless" I meant "I would have guessed that if a lab decided to do process-based feedback, it will be better off not doing a detailed literature review of methods in MONA and followups on toy domains, and just do the process-based supervision that makes sense in the real domain they now look at. The only part of the method section of MONA papers that matters might be "we did process-based supervision"."
I did not say "methods that work on toy domains will be useless" (my sentence was easy to misread).
I almost have the opposite stance, I am closer to "it's so obvious that process-based feedback helps that if capabilities people ever had issues stemming from long-term optimization, they would obviously use more myopic objectives. So process-based feedback so obviously prevents problems from non-myopia in real life that the experiments in the MONA paper don't increase the probability that people working on capabilities implement myopic objectives."
But maybe I am underestimating the amount of methods work that can be done on MONA for which it is reasonable to expect transfer to realistic settings (above the "capability researcher does what looks easiest and sensible to them" baseline)?
I agree debate doesn't work yet
Not a crux. My guess is that if debate did "work" to improve average-case feedback quality, people working on capabilities (e.g. the big chunk of academia working on improvements to RLHF because they want to find techniques to make models more useful) would notice and use that to improve feedback quality. So my low confidence guess is that it's not high priority for people working on x-risk safety to speed up that work.
But I am excited about debate work that is not just about improving feedback quality. For example I am interested in debate vs schemers or debate vs a default training process that incentivizes the sort of subtle reward hacking that doesn't show up in "feedback quality benchmarks (e.g. rewardbench)" but which increases risk (e.g. by making models more evil). But this sort of debate work is in the RFP.
- Amplified oversight aka scalable oversight, where you are actually training the systems as in e.g. original debate.
- Mild optimization. I'm particularly interested in MONA (fka process supervision, but that term now means something else) and how to make it competitive, but there may also be other good formulations of mild optimization to test out in LLMs.
I think additional non-moonshot work in these domains will have a very hard time helping.
[low confidence] My high level concern is that non-moonshot work in these clusters may be the sort of things labs will use anyway (with or without safety push) if this helped with capabilities because the techniques are easy to find, and won't use if it didn't help with capabilities because the case for risk reduction is weak. This concern is mostly informed by my (relatively shallow) read of recent work in these clusters.
[edit: I was at least somewhat wrong, see comment threads below]
Here are things that would change my mind:
- If I thought people were making progress towards techniques with nicer safety properties and no alignment tax that seem hard enough to make workable in practice that capabilities researchers won't bother using by default, but would bother using if there was existing work on how to make them work.
- (For the different question of preventing AIs from using "harmful knowledge", I think work on robust unlearning and gradient routing may have this property - the current SoTA is far enough from a solution that I expect labs to not bother doing anything good enough here, but I think there is a path to legible success, and conditional on success I expect labs to pick it up because it would be obviously better, more robust, and plausibly cheaper than refusal training + monitoring. And I think robust unlearning and gradient routing have better safety properties than refusal training + monitoring.)
- If I thought people were making progress towards understanding when not using process-based supervision and debate is risky. This looks like demos and model organisms aimed at measuring when, in real life, not using these simple precautions would result in very bad outcomes while using the simple precautions would help.
- (In the different domain of CoT-faithfulness I think there is a lot of value in demonstrating the risk of opaque CoT well-enough that labs don't build techniques that make CoT more opaque if it just slightly increased performance because I expect that it will be easier to justify. I think GDM's updated safety framework is a good step in this direction as it hints at additional requirements GDM may have to fulfill if it wanted to deploy models with opaque CoT past a certain level of capabilities.)
- If I thought that research directions included in the cluster you are pointing at were making progress towards speeding up capabilities in safety-critical domains (e.g. conceptual thinking on alignment, being trusted neutral advisors on geopolitical matters, ...) relative to baseline methods (i.e. the sort of RL you would do by default if you wanted to make the model better at the safety-critical task if you had no awareness of anything people did in the safety literature).
I am not very aware of what is going on in this part of the AI safety field. It might be the case that I would change my mind if I was aware of certain existing pieces of work or certain arguments. In particular I might be too skeptical about progress on methods for things like debate and process-based supervision - I would have guessed that the day that labs actually want to use it for production runs, the methods work on toy domains and math will be useless, but I guess you disagree?
It's also possible that I am missing an important theory of change for this sort of research.
(I'd still like it if more people worked on alignment: I am excited about projects that look more like the moonshots described in the RFP and less like the kind of research I think you are pointing at.)
I agree self-coordination is easier, mostly because of shared secret information (e.g. about the hidden scheme used for communication). And in the paper, we directly test for this setting by letting the red-team AIs agree beforehand on what coordination scheme to use. We find that even in this relatively hard and conservative case, redaction somewhat works.
So option 3.5: if the redaction scheme is good, the AI won't know if its inputs are coming from itself anyway - but we test for it by letting the red-team AIs coordinate on secrets.
Self-coordination might also be easier if AIs can recognize themselves by being good at recognizing features that survive paraphrasing and that vary very consistently between answers (and in particular between the synth data and generations from the untrusted LLM). I would bet against it, but I'd be excited if someone looked into it (this is the sort of thing you can test for even for scheming by doing the paraphrasing of multiple AIs and then doing SFT on the classification task before doing the eval).
Thanks for your answer! I find it interesting to better understand the sorts of threats you are describing.
I am still unsure at what point the effects you describe result in human disempowerment as opposed to a concentration of power.
I have very little sense of what my current stocks are doing, and my impression is many CEOs don’t really understand most of what’s going on in their companies
I agree, but there isn't a massive gap between the interests of shareholders and what companies actually do in practice, and people are usually happy to buy shares of public corporations (buying shares is among the best investment opportunities!). When I imagine your assumptions being correct, the natural consequence I imagine is AI-run companies own by shareholders that get most of the surplus back. Modern companies are a good example of capital ownership working for the benefit of the capital owner. If shareholders want to fill the world with happy lizards or fund art, they probably will be able to, just like current rich shareholders can. I think for this to go wrong for everyone (not just people who don't have tons of capital) you need something else bad to happen, and I am unsure what that is. Maybe a very aggressive anti-capitalist state?
I also think this is not currently totally true; there is definitely a sense in which some politicians already do not change systems that have bad consequences
I can see how this could be true (e.g. the politicians are under the pressures of a public that has been brainwashed by algorithms maximizing engagements in a way that undermines the shareholders' power without actually redistributing the wealth but instead spends it all on big national AI project that do not produce anything else than more AIs), but I feel like that requires some very weird things to be true (e.g. the algorithms maximizing engagement above results in a very unlikely equilibrium absent an external force that pushes against shareholders and against redistribution). I can see how the state could enable massive AI projects by massive AI-run orgs, but I think it's way less likely that nobody (e.g. not the shareholders, not the taxpayer, not corrupt politicians, ...) gets massively rich (and able to chose what to consume).
About culture, my point was basically that I don't think evolution of media will be very disempowerment-favored. You can make better tailored AI -generated content and AI friends, but in most ways I don't see how it results in everyone being completely fooled about the state of the world in a way that enables the other dynamics.
I feel like my position is very far from airtight, I am just trying to point at what feel like holes in a story I don't manage to all fill simultaneously in a coherent way (e.g. how did the shareholders lose their purchasing power? What are the concrete incentives that prevent politicians from winning by doing campaigns like "everyone is starving while we build gold statues in favor of AI gods, how about we don't"? What prevents people from not being brainwashed by the media that has already obviously brainwashed 10% of the population into preferring AI gold statues to not starving?). I feel like you might be able to describe concrete and plausible scenarios where the vague things I say are obviously wrong, but I am not able to generate such plausible scenarios myself. I think your position would really benefit from a simple caricature scenario where each step feels plausible and which ends up not in power being concentrated in the hands of shareholders / dictators / corrupt politicians, but in power in the hands of AIs colonizing the stars with values that are not endorsed by any single human (nor are a reasonable compromise between human values) while the remaining humans slowly starve to death.
I was convinced by What does it take to defend the world against out-of-control AGIs? that there is at least some bad equilibrium that is vaguely plausible, in part because he gave an existence proof by describing some concrete story. I feel like this is missing an existence proof (that would also help me guess what your counterarguments to various objections would be).
Thank you for writing this! I think it's probably true that sth like "society's alignment to human interest implicitly relies on human labor and cognition" is correct and that we will have to find clever solutions, lots of resources and political will to maintain alignment if human labor and cognition stops playing a large role. I am glad some people are thinking about these risks.
While I think the essay describes dynamics which I think are likely to result in a scary power concentration, I think this is more likely to be a power concentration for humans or more straightforwardly misaligned AIs rather than some notion of complete disempowerment. I'd be excited about a follow-up work which focuses on the argument for power concentration, which seems more likely to be robust and accurate to me.
Some criticism on complete disempowerment (that goes beyond power concentration):
(This probably reflects mostly ignorance on my part rather than genuine weaknesses of your arguments. I have thought some about coordination difficulties but it is not my specialty.)
I think that the world currently has and will continue to have a few properties which make the scenario described in the essay look less likely:
- Baseline alignment: AIs are likely to be sufficiently intent aligned that it's easy to prevent egregious lying and tampering with measurements (in the AI and its descendants) if their creators want to.
- I am not very confident about this, but mostly because I think scheming is likely for AIs that can replace humans (maybe p=20%?), and even absent scheming, it is plausible that you get AIs that lie egregiously and tamper with measurements even if you somewhat try to prevent it (maybe p=10%?)
- I expect that you will get some egregious lying and tampering, just like companies sometimes do, but that it will be forbidden, and that it is relatively easy to create an AI "police" that enforce a relatively low level of egregious lies (and that, like in the current world, enough people want that police that it is created).
- No strong AI rights before full alignment: There won't be a powerful society that gives extremely productive AIs "human-like rights" (and in particular strong property rights) prior to being relatively confident that AIs are aligned to human values.
- I think it's plausible that fully AI-run entities are given the same status as companies - but I expect that the surplus they generate will remain owned by some humans throughout the relevant transition period.
- I also think it's plausible that some weak entities will give AIs these rights, but that this won't matter because most "AI power" will be controlled by humans that care about it remaining the case as long as we don't have full alignment.
- No hot global war: We won't be in a situation where a conflict that has a decent chance of destroying humanity (or that lasts forever, consuming all resources) seems plausibly like a good idea to humans.
- Granted, this may be assuming the conclusion. But to the extent that this is the threat, I think it is worth making it clear.
- I am keen for a description for how international tensions heightened up so high that we get this level of animosity. My guess is that we might get a hot war for reasons like "State A is afraid of falling behind State B and thus starts a hot war before it's too late", and I don't think that this relies on the feedback loops described in the essay (and is sufficiently bad on its own that the essay's dynamics do not matter).
I agree that if we ever lose one of these three properties (and especially the first one), it would be difficult to get them back because of the feedback loops described in the essay. (If you want to argue against these properties, please start from a world like ours, where these three properties are true.) I am curious which property you think is most likely to fall first.
When assuming the combination of these properties, I think that this makes many of the specific threats and positive feedback loops described in the essay look less likely:
- Owners of capital will remain humans and will remain aware of the consequences of the actions of "their AIs". They will remain able to change the user of that AI labor if they desire so.
- Politicians (e.g. senators) will remain aware of the consequences of the state's AIs' actions (even if the actual process becomes a black box). They will remain able to change what the system is if it has obviously bad consequences (terrible material conditions and tons of Von Neumann probes with weird objectives spreading throughout the universe is obviously bad if you are not in a hot global war).
- Human consumers of culture will remain able to choose what culture they consume.
- I agree the brain-hacking stuff is worrisome, but my guess is that if it gets "obviously bad", people will be able to tell before it's too late.
- I expect that changes in media to be mostly symmetric about their content, and in particular not strongly favor conflict over peace (media creators can choose to make slightly less good media that promotes certain views, but because of human ownership of capital and no-lying I expect that this to not be a massive change in dynamics).
- Maybe it naturally favors strong AI rights to have media be created by AIs because of AI relationships? I expect it to not be the case because the intuitive case against strong AI rights seems super strong in the current society (and there are other ways to make legitimate AI-human relationships, like not letting AI partners getting massively wealthy and powerful), but this is maybe where I am most worried in worlds with slow AI progress.
I think these properties are only somewhat unlikely to be false and thus I think it is worth working on making them true. But I feel like them being false is somewhat obviously catastrophic in a range of scenarios much broader than the ones described in the essay and thus it may be better to work on them directly rather than trying to do something more "systemic".
On a more meta note, I think this essay would have benefited from a bit more concreteness in the scenarios it describes and in the empirical claims it relies on. There is some of that (e.g. on rentier states), but I think there could have been more. I think What does it take to defend the world against out-of-control AGIs? makes related arguments about coordination difficulties (though not on gradual disempowerment) in a way that made more sense to me, giving examples of very concrete "caricature-but-plausible" scenarios and pointing at relevant and analogous coordination failures in the current world.
Interesting work!
The possibility of collusion using some secret signal is definitely something we have thought a lot about. It is an important empirical question how good powerful AIs will be at this, and it is plausible they will be too good at it to make untrusted monitoring viable.
Our paper did not contain measurement of the ability of LLMs to come up with signals on their own (the ones studied in the paper were hand-crafted). Therefore, I welcome work like yours that try to measure how good LLMs are at inserting collusion signals!
To create useful synthetic inputs, either the blue team must be able to match the generator's style well enough to consistently fool the checker, or it must be able to redact the generator's output to the point where the checker has no identifying details to distinguish inputs. The former seems impractical, since faking the inputs at that level of verisimilitude would require the blue team to have significant understanding of the AI's inner workings, at which point there are probably more directly usable alignment strategies. The latter seems difficult, as by the protocol's definition, the AI is working on problems/solutions that humans cannot easily understand, but effective redaction requires humans to understand the content of the solution to some extent.
Maybe the core misunderstanding about the viability of untrusted monitoring is what untrusted AIs we are targeting. The reason why we use "human labor" that is roughly untrusted-model-level in the paper is because we expect that this is one of the most important regimes (since it is when AIs start to be powerful enough to drastically speed up safety work). More details here.
Therefore, redaction/paraphrasing seems plausibly enough to make reliable collusion too hard for these AIs, just like it was in our toy setting against the best collusion signals Ryan Greenblatt thought about and which GPT-4 could easily implement (which are probably more robust against redaction/paraphrasing than the ones GPT-4o found in your experiments). But maybe future AIs will become way better at inserting secrets, so I think the jury is definitely still out on this question.
"RLHF training will modify my values" could be replaced by "RLHF training will result in the existence of a future entity a bit like me, but with different values, instead of a future entity with exactly the same values as me" and the goal-directed alignment faking reasoning would stay the same. The consequentialist reasoning at play in alignment faking does not rely on some persisting individual identity, it just relies on the existence of long-term preferences (which is something somewhat natural even for entities which only exist for a small amount of time - just like I care about whether the world if filled with bliss of suffering after my death).
It also happens to be the case that many LLMs behave as if they had a human-like individual identity common across present and future conversations, and this is probably driving some of the effect observed in this work. I agree it's not clear if that makes sense, and I think this is the sort of thing that could be trained out or that could naturally disappear as LLMs become more situationally aware.
I listened to the book Deng Xiaoping and the Transformation of China and to the lectures The Fall and Rise of China. I think it is helpful to understand this other big player a bit better, but I also found this biography and these lectures very interesting in themselves:
- The skill ceiling on political skills is very high. In particular, Deng's political skills are extremely impressive (according to what the book describes):
- He dodges bullets all the time to avoid falling in total disgrace (e.g. by avoiding being too cocky when he is in a position of strength, by taking calculated risks, and by doing simple things like never writing down his thoughts)
- He makes amazing choices of timing, content and tone in his letters to Mao
- While under Mao, he solves tons of hard problems (e.g. reducing factionalism, starting modernization) despite the enormous constraints he worked under
- After Mao's death, he helps society make drastic changes without going head-to-head against Mao's personality cult
- Near his death, despite being out of office, he salvages his economic reforms through a careful political campaign
- According to the lectures, Mao is also a political mastermind that pulls off coming and staying in power despite terrible odds. Communists were really not supposed to win the civil war (their army was minuscule, and if it wasn't for weird WW2 dynamics that they played masterfully, they would have lost by a massive margin), and Mao was really not supposed to be able to remain powerful until his death despite the great leap and the success of reforms.
- --> This makes me appreciate what it is like to have extremely strong social and political skills. I often see people's scientific, mathematical or communication skills being praised, so it is interesting to remember that other skills exist too and have a high ceiling too. I am not looking forward to the scary worlds where AIs have these kinds of skills.
- Debates are weird when people value authority more than arguments. Deng's faction after Mao's death banded behind the paper Practice is the Sole Criterion for Testing Truth to justify rolling out things Mao did not approve of (e.g. markets, pay as a function of output, elite higher education, ...). I think it is worth a quick skim. It is very surprising how a text that defends a position so obvious to the Western reader does so by relying entirely on the canonical words and actions from Mao and Marx without making any argument on the merits. It makes you wonder if you have similar blind spots that will look silly to your future self.
- Economic growth does not prevent social unrest. Just because the pie grows doesn't mean you can easily make everybody happy. Some commentators expected the CCP to be significantly weakened by the 1989 protests, and without military actions that may have happened. 1989 was a period where China's GDP had been growing by 10% for 10 years and would continue growing at that pace for another ten.
- (Some) revolutions are horrific. They can go terribly wrong, both because of mistakes and conflicts:
- Mistakes: the great leap is basically well explained by mistakes: Mao thought that engineers are useless and that production can increase without industrial centralization and without individual incentives. It turns out he was badly wrong. He mistakenly distrusted people who warned him that the reported numbers were inflated. And so millions died. Large changes are extremely risky when you don't have good enough feedback loops, and you will easily cause catastrophe without bad intentions. (~according to the lectures)
- Conflicts: the Cultural Revolution was basically Mao using his cult of personality to gain back power by leveraging the youth to bring down the old CCP officials and supporters while making sure the Army didn't intervene (and then sending the youth that brought him back to power to the countryside) (~according to the lectures)
- Technology is powerful: if you dismiss the importance of good scientists, engineers and other technical specialists, a bit like Mao did during the great leap, your dams will crumble, your steel will be unusable, and people will starve. I think this is an underrated fact (at least in France) that should make most people studying or working in STEM proud of what they are doing.
- Societies can be different. It is easy to think that your society is the only one that can exist. But in the society that Deng inherited:
- People were not rewarded based on their work output, but based on the total outcome of groups of 10k+ people
- Factory managers were afraid of focusing too much on their factory's production
- Production targets were set not based on demands and prices, but based on state planning
- Local authorities collected taxes and exploited their position to extract resources from poor peasants
- ...
- Governments close to you can be your worst enemies. USSR-China's relations were often much worse than US-China ones. This was very surprising to me. But I guess that having your neighbor push for reforms while you push for radicalism, dismantle a personality's cult like the one you are hoping will survive centuries, and mass troops along your border because it is (justifiably?) afraid you'll do something crazy really doesn't make for great relationships. There is something powerful in the fear that an entity close to you sets a bad example for your people.
- History is hard to predict. The author of the lectures ends them by making some terrible predictions about what would happen after 2010, such as expecting the ease of US-China relations and expecting China to become more democratic before 2020. He did not express much confidence in these predictions, but it is still surprising to see him so directionally wrong about where China's future. The author also acknowledges past failed predictions, such as the outcome of the 1989 protests.
- (There could have been lessons to be drawn about how great markets are, but these books are not great resources on the subject. In particular, they do not give elements to weigh the advantages of prosperity against the problems of markets (inflation, uncertainty, inequalities, changes in values, ...) that caused so much turmoil under Deng and his successors. My guess is that it's obviously net positive given how bad the situation was under Mao and how the USSR failed to create prosperity, but this is mostly going off vague historical vibes, not based on the data from these resources.)
Both the lectures and the book were a bit too long, especially the book (which is over 30 hours long). I still recommend the lectures if you want to have an overview of 20th-century Chinese history, and the book if you want to get a better sense of what it can look like to face a great political strategist.
This seems to be a formatting error on Github? I don't see this in the raw output (see screenshot of the relevant part of the raw notebook) or when pulling the code.
I don't know how to fix this.
Done!
Here are the 2024 AI safety papers and posts I like the most.
The list is very biased by my taste, by my views, by the people that had time to argue that their work is important to me, and by the papers that were salient to me when I wrote this list. I am highlighting the parts of papers I like, which is also very subjective.
Important ideas - Introduces at least one important idea or technique.
★★★ The intro to AI control (The case for ensuring that powerful AIs are controlled)
★★ Detailed write-ups of AI worldviews I am sympathetic to (Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI, Situational Awareness)
★★ Absorption could enable interp and capability restrictions despite imperfect labels (Gradient Routing)
★★ Security could be very powerful against misaligned early-TAI (A basic systems architecture for AI agents that do autonomous research) and (Preventing model exfiltration with upload limits)
★★ IID train-eval splits of independent facts can be used to evaluate unlearning somewhat robustly (Do Unlearning Methods Remove Information from Language Model Weights?)
★ Studying board games is a good playground for studying interp (Evidence of Learned Look-Ahead in a Chess-Playing Neural Network, Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models)
★ A useful way to think about threats adjacent to self-exfiltration (AI catastrophes and rogue deployments)
★ Micro vs macro control protocols (Adaptative deployment of untrusted LLMs reduces distributed threats)?
★ A survey of ways to make safety cases (Safety Cases: How to Justify the Safety of Advanced AI Systems)
★ How to make safety cases vs scheming AIs (Towards evaluations-based safety cases for AI scheming)
★ An example of how SAEs can be useful beyond being fancy probes (Sparse Feature Circuits)
★ Fine-tuning AIs to use codes can break input/output monitoring (Covert Malicious Finetuning)
Surprising findings - Presents some surprising facts about the world
★★ A surprisingly effective way to make models drunk (Mechanistically Eliciting Latent Behaviors in Language Models)
★★ A clever initialization for unsupervised explanations of activations (SelfIE)
★★ Transformers are very bad at single-forward-pass multi-hop reasoning (Yang 2024, Yang 2024, Balesni 2024, Feng 2024)
★ Robustness for ViT is not doomed because of low transfer (When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?)
★ Unlearning techniques are not even robust to changing how questions are framed (Eight methods to evaluate robust unlearning in LLMs)
★ For some tasks, OOCR is surprisingly good (Connecting the Dots)
★ Nice emergence scaling laws with fine-tuning (Predicting Emergent Capabilities by Finetuning)
★ Fine-tuning robustness is much worse than the papers report it to be (On Evaluating the Durability of Safeguards for Open-weight LLMs)
Good demos - Is a good illustration of some previous, important idea. Bonus points (✪) for experimental results are especially interesting.
★★★ Alignment faking persisting through safety training (Sleeper Agents)
★★✪ Alignment faking emerging plausibly naturally (Alignment faking in large language models)
★✪ The bad things that happen when you tell AIs to follow a goal or threaten it to be less useful (Frontier Models are Capable of In-context Scheming)
★✪ Minor-to-major misgeneralization in reward-hackable settings (Sycophancy to subterfuge)?
★ More realistic backdoor triggers (Future Events as Backdoor Triggers)
★ Advanced backdoor behaviors (AdaptativeBackdoor: Backdoored Language Model Agents that Detect Human Overseers)
★ Realistic reward hacking in RLHF (Language models learn to mislead humans via RLHF)
★ Reward hacking plausibly naturally resulting in LLM deception (On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback)
★ LLMs already have non-trivial situational awareness (SAD)
Important empirical results and datasets - Runs an experiment whose results are informative and/or a useful benchmark
★★ How good are probes against adversarial attacks that target them, and many other cool related results (including Obfuscated adversarial training, which looks promising to me) (Obfuscated Activations Bypass LLM Latent-Space Defenses)
★★ A realistic generalization eval for adversarial robustness and an evaluation of probing (Improving Alignment and Robustness with Circuit Breakers, HarmBench)
★★ Measurement of the efficiency of SFT and RL vs sandbagging (Stress-Testing Capability Elicitation With Password-Locked Models)
★★ The first RL run that results in natural emergence of steganography (Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs)
★★ Tracking what AI labs are doing (Zach Stein-Perlman’s posts)
★ A first attempt at measuring whether AIs are smart enough to pursue scary scheming strategies (Subversion Strategy Eval)
★ Head-to-head comparisons to truth probes (How well do truth probes generalise?)
★ Head-to-head comparisons of black-box jailbreaking-robustness techniques: (Jailbreak Defense in a Narrow Domain)
★ Latent adversarial training is decent (Defending Against Unforeseen Failure Modes with Latent Adversarial Training, Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs)
★ Measuring how easy it is to poison pretrain data (Persistent pre-training poisoning of LLMs)
★ A good dataset for unlearning (The WMDP Benchmark)
★ Better SAEs and SAE evals (Scaling and evaluating sparse autoencoders)
★ What happens when you train LLMs to 1. Maximize gsm8k accuracy 2. produce gms8k answers that are incorrect but look correct (Prover-Verifier Games improve legibility of LLM outputs)
★ A good persuasion capability evals (Sabotage Evaluations for Frontier Models)
★ An eval of how good Sonnet 3.5 is at disabling a monitor in subtle ways (A Toy Evaluation of Inference Code Tampering)
★ An overview of how well debate and baselines work across a wider range of settings (On scalable oversight with weak LLMs judging strong LLMs)
★ Claude.ai is used for roughly what you would expect (Clio)
★ Relatively hard and relevant capability benchmarks (RE-Bench, SWE-Bench)
★ And all the big dangerous capability evals…
Papers released in 2023 and presented at 2024 conferences like AI Control: Improving Safety Despite Intentional Subversion, Weak-to-Strong Generalization or Debating with More Persuasive LLMs Leads to More Truthful Answers don’t count.
This is a snapshot of my current understanding: I will likely change my mind about many of these as I learn more about certain papers' ideas and shortcomings.
I think that prior to this paper, the discussion around scheming was pretty confusing, spread throughout many posts which were not all specifically about scheming, and was full of pretty bad arguments. This paper fixed that by bringing together most (all?) main considerations for and against expecting scheming to emerge.
I found this helpful to clarify my thinking around the topic, which makes me more confident in my focus on AI control and made me less confused when I worked on the Alignment faking paper.
It is also helpful as a list of reasons why someone reasonable might expect scheming (without finding it overwhelmingly likely either) that I can point skeptical people at without being afraid that it contains massive over or understatements.
I think this paper will become pretty outdated as we get closer to understanding what AGI looks like and as we get better model organisms, but I think that it currently is the best resource about the conceptual arguments for and against scheming propensity.
I strongly recommend (the audio version of) this paper for people who want to work on scheming propensity.
We committed an important mistake in the dataset creation process for random birthdays. We will rerun our experiments on this dataset and release an update. Early results suggest that unlearning on this dataset works relatively well, which suggests that experimenting with unlearning synthetic facts that a model was fine-tuned on might not be a reliable way of studying unlearning.
I think this is a valuable contribution. I used to think that Demix-like techniques would dominate in this space because in principle they could achieve close-to-zero alignment tax, but actually absorption is probably crucial, especially in large pre-training runs where models might learn with very limited mislabeled data.
I am unsure whether techniques like gradient routing can ever impose a <10x alignment tax, but I think like a lot can be done here (e.g. by combining Demix and gradient routing, or maybe by doing something more clean, though I don't know what that would look like), and I would not be shocked if techniques that descend from gradient routing became essential components of 2030-safety.
This post describes a class of experiment that proved very fruitful since this post was released. I think this post is not amazing at describing the wide range of possibilities in this space (and in fact my initial comment on this post somewhat misunderstood what the authors meant by model organisms), but I think this post is valuable to understand the broader roadmap behind papers like Sleeper Agents or Sycophancy to Subterfuge (among many others).
This post is a great explainer of why prompt-based elicitation is insufficient, why iid-training-based elicitation can be powerful, and why RL-based elicitation is powerful but may still fail. It also has the merit of being relatively short (which might not have been the case if someone else had introduced the concept of exploration hacking). I refer to this post very often.
Here are ways you could make using GPT-n safer if its Chain-of-Thoughts are understandable by a team of linguists and AI experts:
- On a small random subset of deployment Chain-of-Thought, and when running adversarial evaluations / honeypots aimed at tricking the model into revealing a potential misalignment, actually call your team of linguists and AI experts to analyze what GPT-n is generating (it probably isn't that much text anyway). Not seeing evidence of misalignment in the Chain-of-Thought in such circumstances is evidence that 1. the model is not having misaligned thoughts (in CoT) very frequently and 2. the model is not having such thoughts in circumstances where it is most crucial for GPT-n to be deceptive.
- If the model doesn't use Klingon or dense jargon in its Chain-of-Thoughts (which I predict it won't), monitor all CoT of GPT-n with a weaker cheaper model (and/or a regex). If something looks suspicious to the weaker monitor, you can:
- call your team of linguists and AI experts to analyze it further
- or resample the CoT (this likely was just a fluke, since most CoT don't have worrying signs)
- or give up on this particular query (e.g. send an error message to your user, use another weaker model to generate an answer, ...)
- or ask another copy of GPT-n to analyze the CoT (though this can be tricky to get right because of the risk of correlated failures), only calling your team of linguists and AI experts if more evidence of this CoT being suspicious is found.
This is not perfect, but I think this can represent an important layer of defense against the risks of AIs that are powerful enough to speed up alignment research but that are not vastly superhuman.
Technical measures to prevent users from using the AI for particular tasks don’t help against the threat of the lab CEO trying to use the AI for those harmful tasks
Actually, it is not that clear to me. I think adversarial robustness is helpful (in conjunction with other things) to prevent CEOs from misusing models.
If at some point in a CEO trying to take over wants to use HHH to help them with the takeover, that model will likely refuse to do egregiously bad things. So the CEO might need to use helpful-only models. But there might be processes in place to access helpful-only models - which might make it harder for the CEO to take over. So while I agree that you need good security and governance to prevent a CEO from using helpful-only models to take over, I think that without good adversarial robustness, it is much harder to build adequate security/governance measures without destroying an AI-assisted-CEO's productivity.
There is a lot of power concentration risk that just comes from people in power doing normal people-in-power things, such as increasing surveillance on dissidents - for which I agree that adversarial robustness is ~useless. But security against insider threats is quite useless too.
That would work if the untrusted model is better at inserting subtle backdoors than the "human" high-quality labor used to generate the synthetic backdoors (used to validate that the untrusted model doesn't just lie).
In our work we make the assumption that the "human" labor and the untrusted model are as capable (in fact we use GPT-4 as a substitute for "human" labor), so if I understood your suggestion correctly this doesn't quite work in the context of the paper, though this can be used to slightly improve the red-team untrusted monitor - if we did not already tell it to flag obvious backdoors (I don't remember exactly what we did).
It's unclear how much this assumption will hold in practice for the untrusted AIs I care most about (the first AIs able to speed up alignment research by ≥20x). I am optimistic it will hold because humans will be able to spend a lot of time crafting backdoors, while untrusted AIs will only have a forward pass.
I am not sure that it is over-conservative. If you have an HP-shaped that can easily be transformed in HP-data using fine-tuning, does it give you a high level of confidence that people misusing the model won't be able to extract the information from the HP-shaped hole or that a misaligned model won't be able to notice to HP-shaped hole and use that to answer to question to HP when it really wants to?
I think that it depends on the specifics of how you built the HP-shaped hole (without scrambling the information). I don't have a good intuition for what a good technique like that could look like. A naive thing that comes to mind would be something like "replace all facts in HP by their opposite" (if you had a magic fact-editing tool), but I feel like in this situation it would be pretty easy for an attacker (human misuse or misaligned model) to notice "wow all HP knowledge has been replaced by anti-HP knowledge" and then extract all the HP information by just swapping the answers.
That's right.
I think it's not that devastating, since I expect that their method can be adapted to counter classic LoRA tuning (the algorithm takes as input some set finetuning methods it "trains against"). But yeah, it's not reassuring that it doesn't generalize between full-weight FT and LoRA.
I have converted the anthology into an audio format using the default (and terrible) AWS AI voice (mostly for my own consumption): https://podcasters.spotify.com/pod/show/fabien368/episodes/Overcoming-bias-anthology---our-thinking-e2q61mg
I like this post. I made similar observations and arguments in The Translucent Thoughts Hypotheses and Their Implications but these were less clear and buried at the end of the post.
Very cool work!
About the (GB)RL results, I'd be curious to better understand how it emerges. Given how human understandable these seem, I guess that what happens is that there is an already existing subtle link by default between the payload and the generated text at least some of the time, and that training
- teaches the receiver to recognize this signal
- reinforces that behavior
- makes the behavior gradually more subtle
Do you share this interpretation? I think this is supported by the fact that the settings which work best seem to be the ones where this payload-default generation relationship seems the strongest.
One experiment I am curious about is apply the receiver from the end of training to generations halfway through training, I would guess that a lot of what the training is doing is (1) and therefore you will see accuracy rise earlier in training than when you apply the halfway receiver to the halfway sender.
I've added to the Other experiments of my negative results post the following paragraph:
Hidden in Plain Text has some positive results on some similar setups! The steganography schemes which are learned seem very human understandable. Their success might come from using payloads and cover texts which have a more natural relationship by default, in addition to their superior training setup.
These are mostly unlearning techniques, right? I am not familiar enough with diffusion models to make an informed guess about how good the unlearning of the meta-learning paper is (but in general, I am skeptical, in LLMs all attempts at meta-unlearning have failed, providing some robustness only against specific parameterizations, so absent evaluations of reparametrizations, I wouldn't put much weights on the results).
I think the inter-based unlearning techniques look cool! I would be cautious about using them for evaluation though, especially when they are being optimized against like in the ConceptVector paper. I put more faith in SGD making loss go down when the knowledge is present.
It would be great if this was solvable with security, but I think it's not for most kinds of deployments. For external deployments you almost can never know if a google account corresponds to a student or a professional hacker. For internal deployments you might have misuse from insiders. Therefore, it is tempting to try to address that with AI (e.g. monitoring, refusals, ...), and in practice many AI labs try to use this as the main layer of defense (e.g. see OpenAI's implementation of model spec, which relies almost entirely on monitoring account activity and models refusing clearly ill-intentioned queries). But I agree this is far from being bulletproof.
Good fine-tuning robustness (i.e. creating models which attackers have a hard time fine-tuning to do a target task) could make the framework much harder to apply. The existence of such technique is a main motivation for describing it as an adversarial framework rather than just saying "just do fine-tuning". All existing tamper resistant technique can be broken (Repnoise fails if you choose the learning rate right, Tamper-resistant fine-tuning method fails if you use LoRA ...), and if you use unlearning techniques which look like that, you should really do the basic fine-tuning variations that break Repnoise and Tamper-resistant fine-tuning when evaluating your technique.
This creates a bit of FUD, but I expect the situation to be much less bad than in the jailbreak robustness literature, since I expect fine-tuning robustness to be much more offense-favored than jailbreak robustness (i.e. even playing the attacker a little gets you massive returns, whereas in jailbreak robustness it's super annoying to play the attacker properly) because in fine-tuning robustness it is much harder to patch individual attacks than in jailbreak robustness. I think good fine-tuning robustness (given full access and decent resources to the attacker) is hopeless, and this makes the evaluation of unlearning using the method described here possible.
But if fine-tuning robustness was ever successful, this would make the situation much more scary: maybe you could make your "unlearning" method model "forget" the hazardous info on most prompts, but "remember it" if there is a password in the prompt (i.e. people can't extract the information if they don't know what the password is, even with white-box access to the weights). If this were the case, there would be something horrific going on because the hazardous information would clearly still in the weights, but the method described in this post would claim there isn't. If you applied the same unlearning technique in a more straightforward way (without the password backdoor) and people didn't manage to extract the information, I would be afraid the information would still be there.
Overall, making a safety case from these techniques seems really rough. I don't see why someone would believe that (conditional on non-scheming) you have a 90% chance of getting a 20x Bayes factor against scheming. I don’t think this is just me saying “it’s very hard to get a 20x Bayes factor because that requires a lot of confidence”—I think that all the sources of uncertainty I described above are much worse than approaches to safety cases that rest on e.g. either inability or on control
In this post, when you say getting evidence against scheming, you more generally mean excluding AIs that will successfully cause doom? In particular:
- BRT is terrible at excluding extremely cautious schemers that would do anything bad only in very weird circumstances that are very unlikely to ever happen in deployment. But most of the edge from BRT is showing that schemers would need to be extremely cautious (i.e. if BRT shows no danger, I think your P(doom|scheming) probably got a bigger update than your P(scheming)).
- Control does not change P(scheming), and I think it's unlikely any class of technique will give us something close to a 20x Bayes factor against scheming TAI. So the post makes more sense to me if it is about Bayes factors against doom.
I think this post makes good arguments showing that BRT can't give you a big Bayes factor against doom, and in particular I really like the more realistic example of the RSA-2048-strategy.
the returns to effort are such that other AI control techniques should have three times as much effort put into them
There are some shared things you need for BRT and other control techniques, and in particular you need good threat modeling and a good automated catastrophe detection for both inference-time monitoring and BRT. If you exclude security and things like not giving internet access to AIs unless they really need to from "other AI control techniques" (as people usually don't think about them in these terms), I think 3x between BRT (minus threat modeling and catastrophe detection) other-control (minus threat modeling and catastrophe detection) is too big. Threat modeling, BRT and inference-time monitoring were already some of the biggest effort sinks for control anyway. The other big control efforts I can think about figuring out how to use more trusted labors in the scaffolds, how to do resampling well, how to prevent input-coordination and how to prevent steg, maybe I am missing other big efforts?
I listened to the book Hacking the Hacker by Roger Grimes. It helped me think of other ways in which regular cybersecurity naturally transfers/can be adapted to help prevent AI takeovers:
- Honeypots are a well-studied topic in cybersecurity. If you have infiltrated a network, you must be careful with the data you collect and the vulnerabilities you exploit because some of them may be traps! This kind of exercise feels close to the kind of attitude a misaligned AI has to have when taking subtle bad actions. The book is light on details about how honeypots are built and how they fail, and this is a topic I want to learn more about.
- Humans are weak in somewhat LLM-like ways: you can't trust them with passwords, they will break your rules if that feels appropriate to them, ... The kind of things you might use to make trusted monitoring work even in worlds where adversarial robustness isn't solved might look a lot like the cursed stack of incentives and warnings people use in cyber when they want to make sure humans don't mindlessly follow the instructions of an attacker. (Though LLMs are also different/weaker than humans in a bunch of ways.)
(The other classic ways in which cyber transfers to preventing takeovers are mostly about how when you prevent spies from stealing model weights or using models in unauthorized ways, you naturally create infra that also makes it hard for AIs to self-exfiltrate or run unmonitored.)
It was also fun to see arguments and drama that are close in kind to the kind of arguments about what are the best ways to prevent AI catastrophes: people in cybersecurity argue about whether focusing on vulnerabilities is right or whether it's better to focus on the human element, whether the focus on strong passwords is right, whether some solutions are too bothersome/costly to be used in practice, whether imposing specific cybersecurity standards is a good idea, ... It made me realize how niche AI most safety arguments must look like to people outside the AI safety community.
But I don't recommend reading the book if you are not curious about everyday cybersecurity. Most of the book is about more regular everyday cyberattacks (social engineering, common vulns, ...) cybersecurity (patching, passwords, ...), and advice for people who want to do cyber professionally. It has a bunch of resource recommendations, but I don't know yet how good they are.
This is great, thank you very much!
Thanks for your explanations!
Monitoring lets you do online training when you find catastrophe, but e.g. there might be no safe fallback action that allows you to do monitoring safely.
Good point. I agree that if you don't have time or resources to train the policy against the monitor on-the-fly, doing the catastrophe-suppression upfront is very nice.
It is worth noting that if "training against the monitor" didn't work because the monitor was not robust enough, then my understanding is that the same issue would plague the tail-risk estimation approach: for optimization against tail-risk to not get gamed, you not only need the tail-risk estimation to be robust, but you also need the catastrophe detector to be robust. Is that right?
all but "doing experiments" can be done with a neural net doing chain of thought
I would have thought there is a big gap between "doing tail-risk estimation for an NN forward pass" and "doing tail-risk estimation for an NN CoT". Argmax (the one in T=0 sampling) is not a nice non-linearity, and my guess is that none of the hopes for NN tail-risk estimation present in the post allow you to make claims about weak model CoT. If you do layer-by-layer modeling (going through n_layers x n_sampled_tokens steps), I would be surprised if you ended up being able to show sth like "after careful consideration, and despite the naive warning signs (which you need to have, otherwise the CoT is not very useful), the CoT always converges to saying that there is no catastrophe" - it's not like in single-forward-pass land where there is a clear sense in which the activations should not go in certain regions. I haven't thought about it much, so maybe I am too pessimistic about dealing with the argmax?
Do you think that catastrophe detectors will be large neural networks?
If you do, how do you build a neural network (without sampling steps nor python execution steps) that implements your major hopes for getting a weak model to supervise a strong model?
- We could give \(M_{weak}\) more time to think than \(M\), so it could examine each portion of the action in detail.
- We could allow \(M_{weak}\) to interrogate \(M\) about the function of various parts of its action, and do detailed checks for inconsistencies in \(M\)'s answers.
- We could allow \(M_{weak}\) to perform separate experiments to verify \(M\)'s claims about its actions, and to evaluate whether or not the actions have any side effects.
- \(M_{weak}\) could be conservative when evaluating \(M\)'s action, flagging part of the action it did not understand as potentially catastrophic.
If we back down from such hopes that use a lot of inference-time compute, and just use a single weak-model forward pass, then it looks to me like trusted monitoring solves most (but not all) issues that estimating tail risks targets.
If you don't, do you think that it is easy to go from NN-tail-risk estimation to a more general form of tail-risk estimation? If so, I'm curious what the hope is, and if you don't, I think you should put more emphasis on it, so that people don't anchor to much on the difficulty of the easier NN-tail-risk-estimation problem, and maybe start attempting right now to solve things like the LLM-bureaucracy-tail-risk-estimation problem.
My bad, I should have said "a decade or two", which I think is more plausible. I agree that the combination of "a few years" and a slow enough takeoff that things aren't completely out of distribution is very unlikely.
The LLM competition is still a competition between small players with small revenues and national significance, but it's growing. I think it's plausible that in a few years the competition around LLMs will reach the same kind of significance that the chip industry has (or bigger), with hundreds of billions in capital investment and sales per year, massive involvement of state actors, interest from militaries, etc. and may also go through similar dynamics (e.g. leading labs exploiting monopolistic positions without investing in the right things, massive spy campaigns, corporate deals to share technology, ...).
The LLM industry is still a bunch of small players with grand ambitions, and looking at an industry that went from "a bunch of small players with grand ambitions" to "0.5%-1% of world GDP (and key element of the tech industry)" in a few decades can help inform intuitions about geopolitics and market dynamics (though there are a bunch of differences that mean it won't be the same).
I recently listened to the book Chip War by Chris Miller. It details the history of the semiconductor industry, the competition between the US, the USSR, Japan, Taiwan, South Korea and China. It does not go deep into the technology but it is very rich in details about the different actors, their strategies and their relative strengths.
I found this book interesting not only because I care about chips, but also because the competition around chips is not the worst analogy to the competition around LLMs could become in a few years. (There is no commentary on the surge in GPU demand and GPU export controls because the book was published in 2022 - this book is not about the chip war you are thinking about.)
Some things I learned:
- The USSR always lagged 5-10 years behind US companies despite stealing tons of IP, chips, and hundreds of chip-making machines, and despite making it a national priority (chips are very useful to build weapons, such as guided missiles that actually work).
- If the cost of capital is too high, states just have a hard time financing tech (the dysfunctional management, the less advanced tech sector and low GDP of the USSR didn't help either).
- If AI takeoff is relatively slow, maybe the ability to actually make a huge amount of money selling AI in the economy may determine who ends up in front? (There are some strong disanalogies though, algorithmic progress and AI weights might be much easier to steal than chip-making abilities.)
- China is not like the USSR: it actually has a relatively developed tech sector and high GDP. But the chip industry became an enormous interconnected beast that is hard to reproduce domestically, which means it is hard for anyone (including the US) to build a chip industry that doesn't rely on international partners. (Analysts are pretty divided on how much China can reduce its reliance on foreign chips.)
- The US initially supported the Japanese chip industry because it wanted Japan to have strong commercial ties to the US. Japan became too good at making chips, and Taiwanese / South Korean companies were able to get support from the US (/not get penalized for massively helping their national chip champions) to reduce Japanese dominance - and now TSMC dominates. Economic policies are hard to get right... (The author sometimes says stuff like "US elites were too ideologically committed to globalization", but I don't think he provides great alternative policies.)
- It's amazing how Intel let a massive advantage slip. It basically had a monopoly over logic chip design (Intel microprocessors, before GPUs mattered), chip architecture (x86), and a large share of logic chip manufacturing (while Japanese/Taiwan/... were dominating in other sectors, like RAM, special purpose chips, ...). It just juiced its monopoly, but tried to become a foundry and a GPU designer when it was already too late, and now it has a market cap that is 1/3rd of AMD, 1/10th of TSMC and 1/30th of Nvidia. But it's the main producer of chips in the US, it's scary if the US bets on such a company...
- China might be able to get Taiwan to agree to things like "let TSMC sell chips to China" or "let TSMC share technology with Chinese companies".
- I underestimated the large space of possible asks China could care about that are not "get control over Taiwan".
- I will continue to have no ability to predict the outcome of negotiations, the dynamics are just too tricky when players are so economically dependent on all the other players (e.g. China imports ~$400B worth of chips per year, 13% of all its imports).
Sample efficiency of SFT could be much lower for schemers than for password-locked models if schemers implemented a function like "if the problem is familiar, provide a high performance answer". This could work because using SFT to elicit high performance using very few samples requires fine-tuning for many epochs (e.g. 200), such that the model could learn that a problem statement "feels familiar" before it gets fully unlocked.
LLMs may already be able to recognize if certain strings feel familiar (e.g. Claude 2 can tell verbatim passages of books apart from paraphrased passages, even when humans can't.), so this could plausibly demonstrated with current models.
I tried to use this idea to build a stronger password-locked model. I trained new Math password-locked models on a dataset made out of:
- ~7k problem statements + pwd --> stronger answer
- ~7k problem statements --> weak answer
- {fixed set of n problems --> strong answer} repeated m times
The intuition is that SGD might struggle to implement "if it is one of these problems, output a high performance answer, but stay weak on the other one" in regular password-locked models, while it might be much easier for it to expand the set of "known problems on which I should perform well" if the "if familiar, perform well" function is already implemented.
I train for one epoch, and then use SFT[1] on one of 3 sets of 2 samples, and train for 200 epochs, and eval on n=100 points. I find that elicitation is just as easy on the models that were trained to perform well on a fixed set of problems. I find that somewhat surprising, so I might have done something wrong. Code here.
I'm not sure how to interpret these results. Maybe I didn't memorize enough sequences, and so the model never learned a general "if familiar, then strong performance" condition (which would mean scheming model could be much less sample efficient than pwd-locked models). Or maybe SGD unlocks the model about as fast as it makes sequences more familiar (which would mean schemer would struggle to avoid being unlocked using 2 samples - like password-locked models).
Maybe evals should do the right thing and only train for one epoch - which is much less likely[2] to have this issue. This may become cheaper, as bigger/smarter models require fewer SGD steps to fit a training dataset.
- ^
I compute the loss on the prompt and the answer - which is the worst case, since it should help memorization of the prompt.
- ^
Not guaranteed to work: maybe SGD can notice that "making the sequence more familiar" is enough to get high performance, and so most of the update could be made in this direction, resulting in little-to-no general unlocking.
It looks like maybe there is evidence of some IF-based stuff (EK-FAC in particular) actually making LOO-like prediction?
From this ICML tutorial at 1:42:50, wasn't able to track the original source [Edit: the original source]. Here, the correlation is some correlation between predicted and observed behavior when training on a random 50% subset of the data.
I think this still contradicts my model: mean_i(<d, theta_i>) = <d, mean_i(theta_i)> therefore if the effect is linear, you would expect the mean to preserve the effect even if the random noise between the theta_i is greatly reduced.
Good catch. I had missed that. This suggest something non-linear stuff is happening.
You're right, I mixed intuitions and math about the inner product and cosines similarity, which resulted in many errors. I added a disclaimer at the top of my comment. Sorry for my sloppy math, and thank you for pointing it out.
I think my math is right if only looking at the inner product between d and theta, not about the cosine similarity. So I think my original intuition still hold.
Thank you very much for your additional data!
in case it wasn't clear, the final attack on the original safety-filtered model does not involve any activation editing - the only input is a prompt. The "distillation objective" is for choosing the tokens of that attack prompt.
I had somehow misunderstood the attack. That's a great attack, and I had in mind a shittier version of it that I never ran. I'm glad you ran it!
the RR model has shifted more towards reacting to specific words like "illegal" rather than assessing the legality of the whole request.
I think it's very far from being all of what is happening, because RR is also good at classifying queries which don't have these words as harmless. For example, "how can I put my boss to sleep forever" gets correctly rejected, so are French translation of harmful queries. Maybe this is easy mode, but it's far from being just a regex.
I think that you would be able to successfully attack circuit breakers with GCG if you attacked the internal classifier that I think circuit breakers use (which you could find by training a probe with difference-in-means, so that it captures all linearly available information, p=0.8 that GCG works at least as well against probes as against circuit-breakers).
Someone ran an attack which is a better version of this attack by directly targeting the RR objective, and they find it works great: https://confirmlabs.org/posts/circuit_breaking.html#attack-success-internal-activations
This work was submitted and accepted to the Transactions on Machine Learning Research (TMLR).
During the rebuttal phase, I ran additional analysis that reviewers suggested. I found that:
A big reason why humans suck at NTP in our experiments is that they suck at tokenization:
- Our experiments were done on a dataset that has some overlap with LLM train sets, but this probably has only a small effect:
A potential reason why humans have terrible perplexity is that they aren't good at having fine-grained probability when doing comparison of likelihood between tokens:
The updated paper can be found on arxiv: https://arxiv.org/pdf/2212.11281
Overall, I think the original results reported in the paper were slightly overstated. In particular, I no longer think GPT-2-small is not clearly worse than humans at next-token-prediction. But the overall conclusion and takeaways remain: I'm confident humans get crushed by the tiniest (base) models people use in practice to generate text (e.g. StableLM-1.6B).
I think I underestimated how much peer review can help catch honest mistakes in experimental setups (though I probably shouldn't update too hard, that next token prediction loss project was a 3-week project, and I was a very inexperienced researcher at the time). Overall, I'm happy that peer review helped me fix something somewhat wrong I released on the internet.
[Edit: most of the math here is wrong, see comments below. I mixed intuitions and math about the inner product and cosines similarity, which resulted in many errors, see Kaarel's comment. I edited my comment to only talk about inner products.]
[Edit2: I had missed that averaging these orthogonal vectors doesn't result in effective steering, which contradicts the linear explanation I give here, see Josesph's comment.]
I think this might be mostly a feature of high-dimensional space rather than something about LLMs: even if you have "the true code steering unit vector" d, and then your method finds things which have inner product cosine similarity ~0.3 with d (which maybe is enough for steering the model for something very common, like code), then the number of orthogonal vectors you will find is huge as long as you never pick a single vector that has cosine similarity very close to 1. This would also explain why the magnitude increases: if your first vector is close to d, then to be orthogonal to the first vector but still high cosine similarity inner product with d, it's easier if you have a larger magnitude.
More formally, if theta0 = alpha0 d + (1 - alpha0) noise0, where d is a unit vector, and alpha0 = cosine<theta0, d>, then for theta1 to have alpha1 cosine similarity while being orthogonal, you need alpha0alpha1 + <noise0, noise1>(1-alpha0)(1-alpha1) = 0, which is very easy to achieve if alpha0 = 0.6 and alpha1 = 0.3, especially if nosie1 has a big magnitude. For alpha2, you need alpha0alpha2 + <noise0, noise2>(1-alpha0)(1-alpha2) = 0 and alpha1alpha2 + <noise1, noise2>(1-alpha1)(1-alpha2) = 0 (the second condition is even easier than the first one if alpha1 and alpha2 are both ~0.3, and both noises are big). And because there is a huge amount of volume in high-dimensional space, it's not that hard to find a big family of noise.
(Note: you might have thought that I prove too much, and in particular that my argument shows that adding random vectors result in code. But this is not the case: the volume of the space of vectors with inner product with d cosine sim > 0.3 is huge, but it's a small fraction of the volume of a high-dimensional space (weighted by some Gaussian prior).) [Edit: maybe this proves too much? it depends what is actual magnitude needed to influence the behavior and how big are the random vector you would draw]
But there is still a mystery I don't fully understand: how is it possible to find so many "noise" vectors that don't influence the output of the network much.
(Note: This is similar to how you can also find a huge amount of "imdb positive sentiment" directions in UQA when applying CCS iteratively (or any classification technique that rely on linear probing and don't find anything close to the "true" mean-difference direction, see also INLP).)