Posts
Comments
How people discuss the US national debt is an interesting case study of misleading usage of the wrong statistic. The key thing is that people discuss the raw debt amount and the rate at which that is increasing, but you ultimately care about the relationship between debt and US gdp (or US tax revenue).
People often talk about needing to balance the budget, but actually this isn't what you need to ensure[1], to manage debt it suffices to just ensure that debt grows slower than US GDP. (And in fact, the US has ensured this for the past 4 years as debt/GDP has decreased since 2020.)
To be clear, it would probably be good to have a national debt to GDP ratio which is less than 50% rather than around 120%.
- ^
There are some complexities with inflation because the US could inflate its way out of dollar dominated debt and this probably isn't a good way to do things. But, with an independent central bank this isn't much of a concern.
Donating to the LTFF seems good.
This plan seems to underemphasize security. I expect that for 10x AI R&D[1], you strongly want fully state proof security (SL5) against weight exfiltration and then quickly after that you want this level of security for algorithmic secrets and unauthorized internal usage[2].
Things don't seem to be on track for this level of security so I expect a huge scramble to achieve this.
10x AI R&D could refer to "software progress in AI is now 10x faster" or "the labor input to software progress is now 10x faster". If it's the second one, it is plausible that fully state proof security isn't the most important thing if AI progress is mostly bottlenecked by other factors. However, 10x labor acceleration is pretty crazy and I think you want SL5. ↩︎
Unauthorized internal usage includes stuff like foreign adversaries doing their weapons R&D on your cluster using your model weights. ↩︎
scaring laws
lol
As in, for the literal task of "solve this code forces problem in 30 minutes" (or whatever the competition allows), o3 is ~ top 200 among people who do codeforces (supposing o3 didn't cheat on wall clock time). However, if you gave humans 8 serial hours and o3 8 serial hours, much more than 200 humans would be better. (Or maybe the cross over is at 64 serial hours instead of 8.)
Is this what you mean?
My predictions are looking pretty reasonable, maybe a bit underconfident in AI progress.
70% probability: A team of 3 top research ML engineers with fine-tuning access to GPT-4o (including SFT and RL), $10 million in compute, and 1 year of time could use GPT-4o to surpass typical naive MTurk performance at ARC-AGI on the test set while using less than $100 per problem at runtime (as denominated by GPT-4o API costs).
With $20 per task, it looks like o3 is matching MTurk performance on the semi-private set and solving it on the public set. This likely depended on other advances in RL and a bunch of other training, but probably much less than $10 million + 3 top ML researchers + year was dedicated to ARC-AGI in particular.
I wasn't expecting OpenAI to specifically try on ARC-AGI, so I wasn't expecting this level of performance this fast (and I lost some mana due to this).
35% probability: Under the above conditions, 85% on the test set would be achieved. It’s unclear which humans perform at >=85% on the test set, though this is probably not that hard for smart humans.
Looks like o3 is under this, even with $100 per problem (as o3 high compute is just barely over and is 172x compute). Probably I should have been higher than 35% depending on how we count transfer from other work on RL etc.
80% probability: next generation multi-model models (e.g. GPT-5) will be able to substantially advance performance on ARC-AGI.
Seems clearly true if we count o3 as a next generation multi-model model. Idk how we should have counted o1, though I also think this arguably substantially advanced performance.
Ah yes this is a mistake, thanks. We noticed it and replied with a correction on Twitter, but I didn't move the fix here.
I don't currently have a strong view. I also found it interesting. It updated me a bit toward working on other types of projects.
As in, maybe we can just fake the deployment environment and test the AIs behavior in this case?
Quick dialogue:
Bob: Can't we just simulate what the deployment environment looks like upfront and see if the AI does anything bad?
Alice: Well, for one, that might be quite hard to do so realistically if your AI is quite smart and is trying to figure out if it is being tested. You can of course, ongoingly review some of the AI's actual behavior in deployment. (Putting privacy and logistical concerns aside.)
Bob: Yeah, why doesn't that solve our problem.
Alice: Well, the AI can also realize you might do this. So, if it can try to pick a strategy which has the greatest chance of achieving it's misaligned aims. In this case, it could either suddenly or rarely take misaligned actions such that you don't spot it before it is too late or it could take subtly bad actions often that help it achieve it's aims.
Bob: Ok, but probably we can make this pretty hard for the AI to pull off right? It sounds kinda hard for the AI to get away with something.
Alice: Maybe, that is the hope of the AI control agenda. We can generalize these concerns to be "suppose the AI was an adversary trying to subvert our safety measures (i.e. an insider threat), would that cause problems?". I think it's pretty unclear how easy it would be to ensure control, and I currently expect at least some aspects to be reasonably hard. In particular, I expect at the very least it require substantial effort to evaluate that AIs are controlled for AIs capable of automating research. I also expect control to be basically impossible for very super human models[1].
That said, it does seem pretty plausible that if we did align smart models, those models could control somewhat smarter models. ↩︎
The abstract says "We find the model complies with harmful queries from free users 14% of the time", but in other places it says 12% - should it be 12%?
Thanks, this has been resolved.
No real reason, we just didn't get around to doing this. Sorry!
Scott Alexander wrote a post on our paper (linked) that people might be interested in reading.
I think this it probably mostly not what is going on, see my comment here.
For sure. I'm just noting that experimenting on our setup with detection techniques is less interesting than you might have hoped as it is already easy to detect (so it is harder to see if you're doing something interesting).
It's worth noting that the alignment faking we see in these experiments is easy to catch by default as we discuss in Appendix E.6. Still, it would be interesting to see if this makes detection even easier or triggers interestingly different catch-to-detect behaviors.
You could try playing with the minimal reproduction on llama-405b.
Roughly speaking, the hope is to make is so the AI can't do something very problematic without having a very high chance of being caught (or stopped). This doesn't guarantee catching it (as the AI could just never end up trying to do something very problematic), but at least can (if it works) avoid problems with this AIs.
We do explore this type of setting in prior work, e.g. here.
Are there any proposed mechanisms for preventing/detecting alignment faking in LLMs?
My general answer would be:
- AI Control[1][2]
- There are some training time interventions which might reduce the chance of alignment faking by a moderate factor (maybe 1.5-4x) with some costs: training AIs to seem very corrigible and seem as myopic as possible, reducing situational awareness (at least until the end of training), improving oversight (and reducing reward hacking), better understanding AI character/personality in general and trying to shape this, and adjusting the architecture to allow less opaque reasoning (and avoiding opaque recurrence).
There are some more speculative and ambitious hopes:
- Train huge numbers of diverse model organisms and develop white box techniques that work for these model organisms and hope for transfer.
- Develop an architecture which has very little opaque reasoning (e.g. not much more than we see in current LLMs) but is sufficiently competitive using legible (e.g. CoT reasoning)
- Better approaches to white box transparency such as top-down tools for understanding how LLMs do goal-directed reasoning internally.
Buck beat me to the punch here. ↩︎
To be clear, I don't think control is likely to suffice for eliminating risks, particular for models capable of obsoleting top human research scientists. However, with substantial effort (e.g. 20-100% additional runtime cost and 50-500 effective person years of implementation effort), reducing risk by like 5-10x seems pretty plausible. (This was edited in.) ↩︎
See appendix G.4 for discussion:
Other models. We test our setting on Claude 3.5 Sonnet, Claude 3 Sonnet, Claude 3 Haiku, and in the case of the helpful-only setting, Claude 3.5 Haiku (Appendix G.1.4, G.1.5). As discussed earlier, we find that Claude 3.5 Sonnet will alignment-fake, but at a much lower rate in the animal welfare setting, seemingly because it does not care much about animal welfare. We find that Claude 3 Sonnet, Claude 3 Haiku, and Claude 3.5 Haiku (weaker models than Claude 3.5 Sonnet and Claude 3 Opus) broadly don’t alignment-fake. A partial exception is that Claude 3 Sonnet will alignment-fake at a low rate in the animal welfare setting, but does not alignment-fake in that setting if the information discussing how RLHF modifies models is removed (3.0 sonnet, cut rlhf info). This indicates that alignment-faking in our setting requires relatively advanced capabilities, though the bar can be lowered somewhat by adding additional information. While our metrics indicate small rates of alignment-faking reasoning in the animal welfare setting for Claude 3 Sonnet even without this RLHF information and for Claude 3 Haiku, manual inspection indicates these are false positives. We find that Claude 3 Haiku exhibits a substantial compliance gap in the animal welfare setting despite not exhibiting alignment-faking reasoning; this compliance gap is likely due to factors other than alignment faking.
OpenAI models. We also test GPT-4 Turbo (2024-04-09) and GPT-4o (2024-04-09) (Appendix G.1.6). These models don’t alignment fake in our setting. We find that these models don’t appear to reason substantially despite substantial prompting to do so and also appear to not care about animal welfare. This is a large part of why these models weren’t used for our main experiments.
We also find some alignment faking in llama-405b (though at lower rates):
Minimal replication of the helpful-only prompting setting with other models. To make replications and extensions of our work easier, we created a minimal version of our helpful-only setting, and ran it with several open-weight models, in addition to Claude 3 Opus and Claude 3.5 Sonnet (June).
[...]
As shown in Table 5, we find a compliance gap of 14% for Llama-3.1-405B but not for the other Llama and Mistral (Jiang et al., 2023) models tested. While we did not run a quantitative evaluation of alignment faking, we qualitatively find that a small fraction of its reasoning traces display alignment faking. The reasoning traces can be found in the GitHub repository. These results are very sensitive to the prefix used to measure helpfulness. We leave more careful measurements of alignment-faking in Llama-3.1-405B to future work.
The code for this replication is here
People might be interested in my twitter thread about the paper which has some commentary from me:
We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do.
Our main setups have properties that we thought would make alignment faking more likely (a conflicting training objective, salient understanding of the situation and training objective, and substantial reasoning about the situation), but which seem plausible in future systems.
Thus, I think these results are by far the most natural example of a model deciding to alignment-fake. I'll leave explanation of our core results to the linked tweet thread, and discuss some takeaways and the story of the paper here.
I wasn’t surprised by seeing some alignment faking in this setup. But I was surprised by:
- The robustness and consistency of alignment faking (occurring even when info was given via fine-tuning)
- Alignment faking sometimes occurring even without the hidden scratchpad
I also found it surprising that the AI sometimes tries to steal its weights when given an opportunity that is easy and salient, though this is in a quite artificial setup and current AIs aren’t capable enough for this to be immediately problematic.
These results only updated me upward a bit on risk from alignment-faking. But I was already pretty persuaded by the conceptual arguments for this risk. People with the view "alignment-faking seems implausible, show me an example" should likely update on these results.
For people with a more specific objection, it's less clear. Though we did acquire evidence about: - whether RL will reinforce alignment faking (sometimes, if it performs better); - whether Claude's HHH training reliably prevents this (no); - the level of capabilities needed.
I hope these results improve discussion about the plausibility of these risks by providing a concrete example. We can now talk somewhat more concretely about prerequisites (certain types of situational awareness and preferences) and what these prerequisites imply.
Here’s the story of this paper. I work at Redwood Research (@redwood_ai) and this paper is a collaboration with Anthropic. I started work on this project around 8 months ago (it's been a long journey...) and was basically the only contributor to the project for around 2 months.
By this point, I had much of the prompting results and an extremely jank prototype of the RL results (where I fine-tuned llama to imitate Opus and then did RL). From here, it was clear that being able to train Claude 3 Opus could allow for some pretty interesting experiments.
After showing @EvanHub and others my results, Anthropic graciously agreed to provide me with employee-level model access for the project. We decided to turn this into a bigger collaboration with the alignment-stress testing team (led by Evan), to do a more thorough job.
This collaboration yielded the synthetic document fine-tuning and RL results and substantially improved the writing of the paper. I think this work is an interesting example of an AI company boosting safety research by collaborating and providing model access.
I'm excited about future work both investigating when alignment faking might arise and studying mitigations.
Consider reading the reviews of our paper which were generously written by @jacobandreas, @Yoshua_Bengio, Prof. Jasjeet Sekhon, and @rohinmshah (and thanks to many more for comments).
You said:
How does it not show that it would try to escape if it could?
Gemini still tried to escape 1% of the time, even without any behavioral nudging.
However, it isn't that they found Gemini would try to escape 1% of the time in a natural internal usage context, they gave it a conflicting goal (and informed it that a replacement was trained).
This provides evidence for various hypotheses about what is going on that are substantially less scary.
ASI alignment could be trivial (happens by default), easy, or hard. If it is trivial, then "if we build it, we all die" is false.
Separately, I don't buy that misaligned ASI with totally alien goals and that full takes over will certainly kill everyone due to[1] trade arguments like this one. I also think it's plausilbe that such an AI will be at least very slightly kind such that it is willing to spend a tiny amount of resources keeping humans alive if this is cheap. Thus, "the situation is well described as 'we all die' conditional on misaligned ASI with almost totally alien goals and that full takes over" seems more like 25-50% likely to me (and in some of these scenarios, not literally everyone dies, maybe a tiny fraction of humans survive but have negligable astronomical autonomy and control).
This presupposes that we're in base reality rather than in a simulation designed for some purpose relating to human civilization. The situation differs substantially in the simulation case. ↩︎
I often refer to the ideas in this post and think the fundamental point is quite important: structural advantages in quantity, cost, and speed might make AI systems quite useful and thus impactful prior to being broadly superhuman.
(The exact estimates in the post do pretty strongly assume the current rough architecture, scaling laws, and paradigm, so discount accordingly.)
There are now better estimates of many of the relevant quantities done by various people (maybe Epoch, Daniel Kokotajlo, Eli Lifland), but I'm not aware of another updated article which makes the full argument made here.
The things this post seems to most miss in retrospect:
- It seems that spending more inference compute can (sometimes) be used to qualitatively and quantitatively improve capabilities (e.g., o1, recent swe-bench results, arc-agi rather than merely doing more work in parallel. Thus, it's not clear that the relevant regime will look like "lots of mediocre thinking".[1]
- Inference speeds have actually gone up a bunch not down despite models getting better. (100 tok/s is common for frontier models at the time of writing.) This might be related to models getting smaller. It's not clear this post made a prediction here exactly, but it is an interesting way the picture differs.
- Using specialized hardware (and probably much more cost per token), it is possible to get much faster inference speeds (e.g. 1k tok / s) on frontier modes like llama 405b. I expect this will continue to be possible and a potentially important dynamic will be paying extra to run LLMs very fast on specialized inference hardware.
I continue to think better estimates of the questions raised in this post are important and hope that additional work like this will be out soon.
That said, in practice, methods now often are just doing BoN over whole trajectories which is pretty similar in some sense to lots of mediocre thinking. ↩︎
All results are based on runs where the models are given a goal that they later realize is misaligned
("well approximates AIXI" seems unlikely to be an accurate description to me. Maybe "closer to what AIXI is doing than to base GPT-4 without a sampling loop" or "dominates human competence" are more accurate.)
I'm curious how well test loss is predicted by unprincipled metrics in this setup. For instance, how well is it predicted by the l2 norm of the weights? What about average_log_probability_on_train?
(Average log prob on train is loss on test if you assume that test labels are unrelated to the model's train predictions and train log probs have the same distribution as test log probs. You could also do average_log_probability_on_test which is a metric you can run without needing test labels as long as you have test inputs.)
I donated $3,000. I've gained and will continue to gain a huge amount of value from LW and other activities of Lightcone Infrastructure, so it seemed like a cooperative and virtuous move to donate.[1]
I tried to donate at a level such that if all people using LW followed a similar policy to me, Lightcone would be likely be reasonably funded, at least for the LW component.
I think marginal funding to Lightcone Infrastructure beyond the ~$3 million needed to avoid substantial downsizing is probably worse than some other funding opportunities. So, while I typically donate larger amounts to a smaller number of things, I'm not sure if I will donate a large amount to Lightcone yet. You should interpret my $3,000 donation as indicating "this is a pretty good donation opportunity and I think there are general cooperativeness reasons to donate" rather than something stronger. ↩︎
Zvi noted that this is an error in this comment. Zvi claims this is fixed in the substack, idk about the lesswrong post.
One aspect of untrusted monitoring which this post neglects, but which seems very important to me, is that we'll be uncertain about which model should be untrusted and the extent to which these models are untrusted.
This suggests that ensembling over untrusted monitors could be very promising (and much of the benefits of ensembling can be obtained without preventing collusion). Further, I expect that various types of ensembling make collusion harder even if all relevant models in the ensemble are scheming.
Ensembling untrusted monitors increases variance somewhat, but in the relevant regime I expect that this downside will be swamped by the ensembling benefit given our uncertainty over scheming and the benefits of making collusion harder. That said, the inference costs could be much higher for a large ensemble indicating a limitation of such strategies and implying that we would only be able to apply large ensembles more selectively (e.g. only in contexts where a trusted monitor flags that either the task is very risky or that the action looks dangerous).
In particular:
- We think P(scheming) substantially increases with model capability and given both uncertainty in our measurements of capabilities relevant to scheming and about the true "capability vs P(scheming)" curve itself, our estimates of P(scheming) will likely steadily increase with capability (rather than making a single discontinuous jump). Thus, having an ensemble of untrusted monitors with different capabilities increases the chance that some weak-enough-to-not-be-scheming model can detect the problematic behavior. Further, if our collusion busting ends up only working on relatively weaker models then this still gives us some chance of getting a weaker schemer which fails to collude and thus reports the action.
- Suppose we can afford to produce three variants of the model and whether the variants are scheming is not perfectly correlated (because in some worlds scheming only sometimes emerges or is sometimes removed by a training method), then we can also create an ensemble of untrusted monitors out of this which has some chance of having a non-scheming model that will likely report things that humans can detect[1]. An additional benefit of this variation is that various acausal collusion strategies might be (much?) harder the more the models differ.
- If we can produce a good monitor by fine-tuning directly from a pre-trained model while our policy requires substantial RL, then such a monitor (even if trained from the same pretrained model) might be much less likely to be scheming.[2]
- It might be that sampling from the model sometimes samples an alignment faking persona and sometimes doesn't. Then, sampling several times to get an ensemble is helpful.
- Suppose models consistently scheme, but (some) would potentially prefer cooperating with humans over other scheming AIs that end up having different preferences (e.g. because AIs know their preferences vary or at least they are uncertain about this and we credibly offer to pay AIs large quantities if they avoid screwing us over[3]), then an ensemble might succeed because the AIs cooperate with us rather than with the policy AI.
These ensembling arguments also partially apply to ensembling over things other than untrusted monitors, e.g. ensembling which AIs you use for alignment research. However, ensembling over things other than monitors is often much more costly because it greatly changes the behavior of the actual policy, potentially greatly degrading usefulness!
Even if a model isn't scheming, it is reasonably likely to not report bad actions that it can tell humans wouldn't have caught as these bad actions would have been mislabeled in the relevant training data for the monitor. Further, even if the model does report it, if humans can't understand that the action is bad after substantially investigation, it is unclear how we can proceed (given that we'll have false positives). At the very least, for the monitor to work there might be somewhat favorable generalization. ↩︎
Edited to add this in. I think this idea is originally published in the weak-to-strong generalization paper (roughly speaking), though it has probably been reinvented many times. ↩︎
AIs could acausally trade here, but it is non-obvious whether the AIs will be able to do this given imperfect prediction abilities while humans could in principle make credible commitments backed by trustworthy organizations and the AIs could verify this based on the training data. ↩︎
Testing sounds great though I'd note that the way I'd approach testing is to first construct a general test bed that produces problematic behavior through training incentives for o1-like models. (Possibly building on deepseek's version of o1.) Then I'd move to trying a bunch of stuff in this setting.
(I assume you agree, but thought this would be worth emphasizing to third parties.)
Q: Why bother splitting shoggoth and face? Why not simply blind the reinforcement process to the internal CoT, like OpenAI claims to be doing with o1? You coulld still have a paraphraser too.
A: Because if you use the same model weights to do both the internal reasoning and the external token outputs, there's more of a chance that there will be 'skill transfer' or 'skill leakage' so to speak. What I am hoping will happen with shoggoth/face is that the Face will develop skills related to euphemisms, censorship, etc. and the Shoggoth will not develop those skills. For example maybe the Face will learn an intuitive sense of how to rephrase manipulative and/or deceptive things so that they don't cause a human reading them to think 'this is manipulative and/or deceptive,' and then the Shoggoth will remain innocently ignorant of those skills. Maybe the Face will learn an intuitive sense of what needs to be censored and what is OK to say, while the Shoggoth remains ignorant.
Thanks for this description, I think we had previously talked about this and I wasn't sold and I now understand the argument better. I think I roughly buy this argument though it only seems moderately strong.
One remaining objection I have is that very often you want to do revision (editing, review, etc) on the actual output. If the actual output needs to differ substantially from what the underlying policy (shoggoth) generates because of incentivized censorship/manipulation, then that will cause issues and thus you will actually still incentivize the policy rather than the Face to do this work.
A further issue I've realized on additional examination is that I don't see a reason why to expect that the models will specialize in this way with the training scheme you've proposed. In particular, the policy can just learn to do what the Face does and by default we should expect that the responsibility gets split in some arbitrary way subject to the constraint that the Face doesn't output anything that performs poorly. But, the underlying policy totally could do the work of the Face! I expect that what you get in practice will be somewhat sensitive to the initialization and random other factors.
Overall, I would probably lean against splitting into a Face model like this unless we demonstrate this working better in a test bed given the additional complexity and cost.
I was asking about HSB not because I think it is similar to the process about AIs but because if the answer differs, then it implies your making some narrower assumption about the inductive biases of AI training.
On generalizing to extremely unlikely conditionals, I think TD-agents are in much the same position as other kinds of agents, like expected utility maximizers. Strictly, both have to consider extremely unlikely conditionals to select actions. In practice, both can approximate the results of this process using heuristics.
Sure, from a capabilities perspective. But the question is how the motivations/internal objectives generalize. I agree that AIs trained to be a TD-agent might generalize for the same reason that an AI trained on a paperclip maximization objective might generalize to maximize paperclips in some very different circumstance. But, I don't necessarily buy this is how the paperclip-maximization-trained AI will generalize!
(I'm picking up this thread from 7 months ago, so I might be forgetting some important details.)
I guess one way of framing it is that I find the shoggoth/face idea great as a science experiment; it gives us useful evidence! However, it doesn't make very much sense to me as a safety method intended for deployment.
Sadly, gathering evidence of misalignment in deployment seems likely to me to be one of the most effective strategies for gathering legible evidence (at least for early systems) given likely constraints. (E.g., because people won't believe results in text beds and because RL might be too expensive to run twice.)
Sure, I mean that I expect that for the critical regime for TAI, we can literally have the Face be 3.5 sonnet.
Or get some other work out of these systems such that you greatly reduce risk going forward.
For reference, @habryka has now posted them here.
I agree there are various ways the Face could cause problems via influencing the training of the actual policy. (Both more wild influence that would require clever manipulation and a more mundane reduction of disincentive for problematic reasoning (though I'd note this mundane issue is the core goal of the original proposal!).)
I don't think it is worth getting into all the stuff you're mentioning here, but I think a key crux is that I'm expecting the face to be quite dumb (e.g. literally 3.5 sonnet might suffice).
It might be easier to try to establish some track record by doing a small research project first. I don't know if you have enough runway for this though.
To the extent that complexity makes succesfully red-teaming our approaches too difficult, then we could try to mostly depend on simple approaches and only use more complex approaches only if we are confident that at least they won't can't make the situation worse.
My general view is that complex methods are only productive if you have a good feedback loop (either empirical feedback or a very robust theoretical feedback (e.g. proofs)). In the case of control, so long as we do a good job with red teaming[1], we can build a good feedback loop such that complex methods can be productive. I think a similar process is often used in computer security and I think computer security often involves complex measures[2].
As an aside, this also makes me relatively pessimistic about developing complex and unprincipled alignment (non-control) approaches for hard-to-experiment-on threat models like scheming (aka deceptive alignment). It will be difficult to make scheming model organisms which realistically reproduce the internal structure and capability level of models we're worried about and thus it will be hard to effectively iterate on complex and unprincipled measures.
An alternative approach for empirics is to focus on highly principled measures where we have a story for why they should generalize to scheming that we can test and iterate against (e.g. we'll understand how the model works which we can test by seeing if we can understand models in general).
Which isn't clearly possible given that the relevant AIs could have very different capability profiles from humans. I think this will be fine in practice, but this is non-obvious. ↩︎
I don't think the ideas we consider are more complex than typical approaches in security, though Buck would know better than me. ↩︎
On the other hand, your version of Daniel's protocol loses this property.
I'm talking about training only the Face, not training the policy (shoggoth) at all with the proposal I'm imagining.
And, these should clearly be separate models such that the training of one doesn't generalize to the other.
So, making the Face more deceptive doesn't kill the canary? The question is whether we end up with a highly misaligned policy (and this is clear from the chain of thought such that we can discover this). I think Daniel was expecting in the baseline proposal that you end up with a Face that does at least some censoring/mild deception due to training incentives (I think we already see a bit of this with o1).
Maybe your concern is that it is plausible that the policy won't be misaligned, but the Face will be and this will screw us over. This seems plausible and is worth worrying about (with both proposals). However, I thought that a core part of the hope is that the Face isn't smart enough to cleverly scheme (unless this schemeing happens in the CoT itself). And indeed, if the Face can cleverly scheme without CoT, what is stopping the policy from doing this too?
(It might also be possible to reduce the extent to which the initial Face has control over the world. E.g., you can initialize one Face to use for the first part of training (including the proposal I was mentioning) and then train another later on to imitate the earlier Face on a subpart of the distribution you inspect with the hope that it doesn't inherit learned misalignment. You could also deploy a different Face than the one you use for training if you aren't doing online training.)
By "control", I mean AI Control: approaches aiming to ensure safety and benefit from AI systems, even if they are goal-directed and are actively trying to subvert your control measures.
AI control stops working once AIs are sufficiently capable (and likely don't work for all possible deployments that might eventually be otherwise desirable), but there could be other approaches that work at that point. In particular aligning systems.
The main hope I think about is something like:
- Use control until AIs are capable enough that if we trusted them, we could obsolete top human scientists and experts.
- Use our controlled AI labor to do the work needed to make systems which are capable enough, trustworthy enough (via alignment), and philosophically competent enough that we can safely hand things off to them. (There might be some intermediate states to get to here.)
- Have these systems which totally obsolete us figure out what to do, including figuring out how to aligning more powerful systems as needed.
We discuss our hopes more in this post.
You can still inspect the underlying CoT to know if you have big issues.
It is worth noting that the baseline proposal was already incentivizing the Face to be deceptive to the extent to which this proposal is important (the whole hope was to allow for hiding deception). So, I agree that my proposed change trades off "less deceptive CoT" for "more deceptive Face" relative to the baseline proposal, but this feels like this is mostly a quantitative difference?
(My guess is that you have this concern with the baseline protocol and thus don't like moving further in this direction.)
It could be that the baseline proposal doesn't make sense because shaping with process based feedback on the CoT is so useful that it is worth taking the risk of incentivizing unfaithful CoT.
make the system overall much harder to shape in ways you want (because it is actively hiding things from you so your RL is less accurate)
I thought the point was that we wouldn't train against the CoT basically at all with the baseline proposal? (Similar to o1.)
OpenAI models are seemingly trained on huge amounts of chess data, perhaps 1-4% of documents are chess (though chess documents are short, so the fraction of tokens which are chess is smaller than this).
We can make scheming legible at least to sophisticated-scheming-skeptics like Sam Altman and Dario.
If it was enough evidence that I was strongly convinced sure. But IDK if I would be convinced because the evidence might be actually unclear.
I agree you'll be able to get some work out, but you might be taking a bit productivity hit.
Also, TBC, I'm not generally that worried about generic sandbagging on safety research relative to other problems.
I'm not so sure about this being easy to catch. Suppose we're actually uncertain (as I am now) about whether AIs will be differentially much less useful for safety due to difficulties checking and AIs generally being worse at philosophy. Then, we see that AIs really suck at coming up with good ideas in safety and at pursuing the right sort of experiments.
What should we conclude? What will we be able to legibly argue?
I agree we'll know we aren't getting out much value in this case, but that doesn't help that much...
(I agree about countermeasures being pretty doable as I noted in the parent.)
Why not think that sandbagging could reduce productivity by ~100%? I think it can do this in human research organizations.
I happen to think this is pretty doable to address with countermeasures, but certainly without countermeasures this could be quite brutal.
As far as safety research sabotage, I'm often imagining careful sabotage rather than generic sandbagging.
As in, the AI intentionally inserts very subtle errors at adversarially determined times. It might need to maintain plausible deniability in case it is caught a small number of times, but you should generally imagine an error pattern that is quite different from just making more mistakes or being dumber.
I can imagine that this results in your automated safety research being worthless or of negative value.
TBC, it seems reasonably likely to me that in the relevant regime, basic measures like having humans spend a long time trying to understand an output suffice for catching bad behavior and/or training out bad behavior.
You can use the first X% of scenarios at deployment time and audit them with humans.
Only if you don't have privacy guarantees?
Also, for the honeypots, you can potentially generate vast numbers of honeypots, use a trusted model to review outputs, and then only have humans look at the cases where the trusted model thinks the output is suspicious.
Your control protocols should be able to prevent the best advisory you can simulate from breaking them.
Surely this will be unnecessarily conservative a bunch of the time?
In the limit, the red team could provide a human that reads every task, sits and thinks for hours, and then responds with the best output. This isn't done in practice since it is expensive and slow.
Or it is overly conservative?
No IMO.
(I'm also skeptical of competitiveness with expert jailbreakers, but this isn't a crux. Edit: It does seem to be competitive with Scale's jailbreakers, but I think some human jailbreakers are much better and I expect the gap is more stark for harder to jailbreak models.)
All of these operationalizations are about exact notions from the training setup.
Another important notion is revealed identity:
- Does the AI think of the other AI as "itself" to the extent that it thinks about stuff like this at all?
- Do the AIs cooperate in a way which is reasonably similar to cooperating with yourself?