Posts
Comments
Still pretty optimistic by the standards of the AGI safety field, somewhat shorter timelines than I reported in that post.
Neither of these really affect the work we do very much. I suppose if I were extremely pessimistic I would be doing something else, but even at a p(doom) of 50% I'd do basically the same things I'm doing now.
(And similarly individual team members have a wide variety of beliefs on both optimism and timelines. I actually don't know their beliefs on those topics very well because these beliefs are usually not that action-relevant for us.)
More capability research than AGI safety research but idk what the ratio is and it's not something I can easily find out
Since we have multiple roles, the interview process varies across candidates, but usually it would have around 3 stages that in total correspond to 4-8 hours of interviews.
We'll leave it up until the later of those two (and probably somewhat beyond that, but that isn't guaranteed). I've edited the post.
Is that right?
Yes, that's broadly accurate, though one clarification:
This is not obvious because trying it out and measuring the effectiveness of MONA is somewhat costly
That's a reason (and is probably sufficient by itself), but I think a more important reason is that your first attempt at using MONA is at the point where problems arise, MONA will in fact be bad, whereas if you have iterated on it a bunch previously (and in particular you know how to provide appropriate nonmyopic approvals), your attempt at using MONA will go much better.
I think this will become much more likely once we actually start observing long-term optimization failures in prod.
Agreed, we're not advocating for using MONA now (and say so in the paper).
Maybe an intervention I am excited about is enough training technique transparency that it is possible for people outside of labs to notice if issues plausibly stems from long-term optimization?
Idk, to be effective I think this would need to be a pretty drastic increase in transparency, which seems incompatible with many security or non-proliferation intuitions, as well as business competitiveness concerns. (Unless you are thinking of lots of transparency to a very small set of people.)
If the situations where you imagine MONA helping are situations where you can't see the long-term optimization problems, I think you need a relatively strong second bullet point
That doesn't seem right. It can simultaneously be the case that you can't tell that there are problems stemming from long-term optimization problems when you don't use MONA, and also if you actually use MONA, then it will measurably improve quality.
For example, perhaps under normal RL you get a coding AI that has learned to skip error-checking code in order to reduce latency (which we'd penalize if we knew about it, but we don't realize that's happening). Later when things are put into production errors happen, but it's chalked up to "well it's hard to anticipate everything".
Instead you use MONA, and it doesn't learn to do this. You compare the resulting coding agent to the original agent, and notice that the MONA agent's lines of code are much more rarely implicated in future bugs, and conclude they are higher quality.
I meant "it's obvious you should use MONA if you are seeing problems with long-term optimization", which I believe is Fabien's position (otherwise it would be "hard to find").
Your reaction seems more like "it's obvious MONA would prevent multi-step reward hacks"; I expect that is somewhat more common (though still rare, and usually depends on already having the concept of multi-step reward hacking).
I have some credence in all three of those bullet points.
For MONA it's a relatively even mixture of the first and second points.
(You are possibly the first person I know of who reacted to MONA with "that's obvious" instead of "that obviously won't perform well, why would anyone ever do it". Admittedly you are imagining a future hypothetical where it's obvious to everyone that long-term optimization is causing problems, but I don't think it will clearly be obvious in advance that the long-term optimization is causing the problems, even if switching to MONA would measurably improve feedback quality.)
For debate it's mostly the first point, and to some extent the third point.
Got it, that makes more sense. (When you said "methods work on toy domains" I interpreted "work" as a verb rather than a noun.)
But maybe I am underestimating the amount of methods work that can be done on MONA for which it is reasonable to expect transfer to realistic settings
I think by far the biggest open question is "how do you provide the nonmyopic approval so that the model actually performs well". I don't think anyone has even attempted to tackle this so it's hard to tell what you could learn about it, but I'd be surprised if there weren't generalizable lessons to be learned.
I agree that there's not much benefit in "methods work" if that is understood as "work on the algorithm / code that given data + rewards / approvals translates it into gradient updates". I care a lot more about iterating on how to produce the data + rewards / approvals.
My guess is that if debate did "work" to improve average-case feedback quality, people working on capabilities (e.g. the big chunk of academia working on improvements to RLHF because they want to find techniques to make models more useful) would notice and use that to improve feedback quality.
I'd weakly bet against this, I think there will be lots of fiddly design decisions that you need to get right to actually see the benefits, plus iterating on this is expensive and hard because it involves multiagent RL. (Certainly this is true of our current efforts; the question is just whether this will remain true in the future.)
For example I am interested in [...] debate vs a default training process that incentivizes the sort of subtle reward hacking that doesn't show up in "feedback quality benchmarks (e.g. rewardbench)" but which increases risk (e.g. by making models more evil). But this sort of debate work is in the RFP.
I'm confused. This seems like the central example of work I'm talking about. Where is it in the RFP? (Note I am imagining that debate is itself a training process, but that seems to be what you're talking about as well.)
EDIT: And tbc this is the kind of thing I mean by "improving average-case feedback quality". I now feel like I don't know what you mean by "feedback quality".
I would have guessed that the day that labs actually want to use it for production runs, the methods work on toy domains and math will be useless, but I guess you disagree?
I think MONA could be used in production basically immediately; I think it was about as hard for us to do regular RL as it was to do MONA, though admittedly we didn't have to grapple as hard with the challenge of defining the approval feedback as I'd expect in a realistic deployment. But it does impose an alignment tax, so there's no point in using MONA currently, when good enough alignment is easy to achieve with RLHF and its variants, or RL on ground truth signals. I guess in some sense the question is "how big is the alignment tax", and I agree we don't know the answer to that yet and may not have enough understanding by the time it is relevant, but I don't really see why one would think "nah it'll only work in toy domains".
I agree debate doesn't work yet, though I think >50% chance we demonstrate decent results in some LLM domain (possibly a "toy" one) by the end of this year. Currently it seems to me like a key bottleneck (possibly the only one) is model capability, similarly to how model capability was a bottleneck to achieving the value of RL on ground truth until ~2024).
It also seems like it would still be useful if the methods were used some time after the labs want to use it for production runs.
It's wild to me that you're into moonshots when your objection to existing proposals is roughly "there isn't enough time for research to make them useful". Are you expecting the moonshots to be useful immediately?
I don't know of any existing work in this category, sorry. But e.g. one project would be "combine MONA and your favorite amplified oversight technique to oversee a hard multi-step task without ground truth rewards", which in theory could work better than either one of them alone.
I'm excited to see this RFP out! Many of the topics in here seem like great targets for safety work.
I'm sad that there's so little emphasis in this RFP about alignment, i.e. research on how to build an AI system that is doing what its developer intended it to do. The main area that seems directly relevant to alignment is "alternatives to adversarial training". (There's also "new moonshots for aligning superintelligence" but I don't expect much to come out of that, and "white-box estimation of rare misbehavior" could help if you are willing to put optimization pressure against it, but that isn't described as a core desideratum, and I don't expect we get it. Work on externalized reasoning can also make alignment easier, but I'm not counting that as "directly relevant".) Everything else seems to mostly focus on evaluation or fundamental science that we hope pays off in the future. (To be clear, I think those are good and we should clearly put a lot of effort into them, just not all of our effort.)
Areas that I'm more excited about relative to the median area in this post (including some of your starred areas):
- Amplified oversight aka scalable oversight, where you are actually training the systems as in e.g. original debate.
- Mild optimization. I'm particularly interested in MONA (fka process supervision, but that term now means something else) and how to make it competitive, but there may also be other good formulations of mild optimization to test out in LLMs.
- Alignable systems design: Produce a design for an overall AI system that accomplishes something interesting, apply multiple safety techniques to it, and show that the resulting system is both capable and safe. (A lot of the value here is in figuring out how to combine various safety techniques together.)
To be fair, a lot of this work is much harder to execute on, requiring near-frontier models, significant amounts of compute, and infrastructure to run RL on LLMs, and so is much better suited to frontier labs than to academia / nonprofits, and you will get fewer papers per $ invested in it. Nonetheless, academia / nonprofits are so much bigger than frontier labs that I think academics should still be taking shots at these problems.
(And tbc there are plenty of other areas directly relevant to alignment that I'm less excited about, e.g. improved variants of RLHF / Constitutional AI, leveraging other modalities of feedback for LLMs (see Table 1 here), and "gradient descent psychology" (empirically studying how fine-tuning techniques affect LLM behavior).)
Control evaluations are an attempt to conservatively evaluate the safety of protocols like AI-critiquing-AI (e.g., McAleese et al.), AI debate (Arnesen et al., Khan et al.), other scalable oversight methods
Huh? No, control evaluations involve replacing untrusted models with adversarial models at inference time, whereas scalable oversight attempts to make trusted models at training time, so it would completely ignore the proposed mechanism to take the models produced by scalable oversight and replace them with adversarial models. (You could start out with adversarial models and see whether scalable oversight fixes them, but that's an alignment stress test.)
(A lot of the scalable oversight work from the last year is inference-only, but at least for me (and for the field historically) the goal is to scale this to training the AI system -- all of the theory depends on equilibrium behavior which you only get via training.)
I don't think you should think of "poor info flows" as something that a company actively does, but rather as the default state of affairs for any fast-moving organization with 1000+ people. Such companies normally need to actively fight against poor info flows, resulting in not-maximally-terrible-but-still-bad info flows.
This is a case where I might be over indexing from experience at Google, but I'd currently bet that if you surveyed a representative set of Anthropic and OpenAI employees, more of them would mostly agree with that statement than mostly disagree with it.
(When I said that there are approximately zero useful things that don't make anyone's workflow harder, I definitely had in mind things like "you have to bug other people to get the info you need", it's just such a background part of my model that I didn't realize it was worth spelling out.)
In broad strokes I agree with Zac. And tbc I'm generally a fan of formal verification and have done part of a PhD in program synthesis.
So, GSAI addresses the band of capabilities where AI systems become potentially no longer safe to interact with directly due to their potential ability to model and exploit human psychology
This seems like a great example of something that I strongly expect GSAI will not handle (unless the proposal reduces to "don't build such AIs", in which case I would appreciate that being stated more directly, or if it reduces to "limit the use of such AIs to tasks where we can formally verify soundness and uniqueness", in which case I'd like an estimate of what fraction of economically valuable work this corresponds to).
Can you sketch out how one produces a sound overapproximation of human psychology? Or how you construct a safety specification that the AIs won't exploit human psychology?
I also agree with Zac, maybe if you had a really well-selected group of 10 people you could do something, but 10 randomly selected AGI safety researchers probably don't accomplish much.
By far my biggest objection is that there are approximately zero useful things that "[don't] make anyone's workflow harder". I expect you're vastly underestimating the complexity of production systems and companies that build them, and the number of constraints they are under. (You are assuming a do-ocracy though, depending on how much of a do-ocracy it is (e.g. willing to ignore laws) I could imagine changing my mind here.)
EDIT: I could imagine doing asynchronous monitoring of internal deployments. This is still going to make some workflows harder, but probably not a ton, so it seems surmountable. Especially since you could combine it with async analyses that the unreasonable developer actually finds useful.
EDIT 2 (Feb 7): To be clear, I also disagree with the compute number. I'm on board with starting with 1% since they are 1% of the headcount. But then I would decrease it first because they're not a high-compute capabilities team, and second because whatever they are doing should be less useful to the company than whatever the other researchers are doing (otherwise why weren't the other researchers doing it?), so maybe I'd estimate 0.3%. But this isn't super cruxy because I think you can do useful safety work with just 0.3% of the compute.
Again, I could imagine getting more compute with a well-selected group of 10 people (though even then 3% seems unlikely, I'm imagining more like 1%), but I don't see why in this scenario you should assume you get a well-selected group, as opposed to 10 random AGI safety researchers.
Yes, it's the same idea as the one you describe in your post. I'm pretty sure I also originally got this idea either via Paul or his blog posts (and also via Jonathan Uesato who I'm pretty sure got it via Paul). The rest of the authors got it via me and/or Jonathan Uesato. Obviously most of the work for the paper was not just having the idea, there were tons of details in the execution.
We do cite Paul's approval directed agents in the related work, which afaik is the closest thing to a public writeup from Paul on the topic. I had forgotten about your post at the time of publication, though weirdly I ran into it literally the day after we published everything.
But yes, mostly the goal from my perspective was (1) write the idea up more rigorously and clearly, (2) clarify where the safety benefits come from and distinguish clearly the difference from other stuff called "process supervision", (3) demonstrate benefits empirically. Also, nearly all of the authors have a much better sense of how this will work out in practice (even though I started the project with roughly as good an understanding of the idea as you had when writing your post, I think). I usually expect this type of effect with empirical projects but imo it was unusually large with this one.
Is that right?
Yup, that sounds basically right to me.
Indeed this seems related! We'd hope that there isn't too much conflict between the different "generations", since the intent is for the overseer to enable long-term plans, but it seems like something that could happen if the overseer is bad enough.
For a simple task like booking a restaurant, we could just ask the (frozen) overseer-AI to pick[1] actions, no?
If it were as simple as "just ask an LLM to choose actions" someone would have deployed this product a while ago.
But in any case I agree this isn't the most interesting case for MONA, I talked about it because that's what Daniel asked about.
In our experiments on both Test-driven development and Loan applications you can see that the ground truth reward goes up with MONA. The ground truth reward at step 0 represents the reward the agent would obtain if it were frozen. So this looks like your option (3), assuming that the overseer and the agent were identical. (Partly this is because we are also mixing in non-AI sources of feedback, like whether the code runs and passes the tests, and whether the AI made the correct decision on the loan, but I think this is a realistic model of future AI development.)
In Test-driven development the argument above isn't quite correct, because we prompted the agent to be a bad programmer but didn't do this with the reward, so the overseer is "stronger" than the agent. However this was just because the agent is already very strongly finetuned to be good at coding so there was no headroom to climb, and we wanted to demonstrate that MONA would improve things if there was headroom to climb. I would bet that if we had a powerful model that wasn't yet finetuned strongly for coding, then we would once again see your option (3). The rewards are quite easy to provide -- just whether an individual test is valid and correct -- so I think a less capable model should be able to provide them, while still getting the benefits we see in the experiments we did run.
Discussed in the paper in Section 6.3, bullet point 3. Agreed that if you're using a prediction market it's no longer accurate to say that individual humans understand the strategy.
(We've seen this comment and are looking into options)
Us too! At the time we started this project, we tried some more realistic settings, but it was really hard to get multi-step RL working on LLMs. (Not MONA, just regular RL.) I expect it's more doable now.
For a variety of reasons the core team behind this paper has moved on to other things, so we won't get to it in the near future, but it would be great to see others working on this!
Thanks, and interesting generalization!
My thoughts depend on whether you train the weaker model.
- If you are training the weaker model to solve the task, then the weaker model learns to simply always accept the advice of the stronger model, and stops being a constraint on what the stronger model can do.
- If you aren't training the weaker model and are just training the stronger model based on whether it convinces the weak model, then you are probably not getting the benefits of RL (specifically: getting the model to generate actions that are evaluated to be good; this can amplify capability because evaluation is often easier / cheaper than generation)
I think (1) is pretty fatal to the proposal, but (2) is just a heuristic, I could imagine with more thought concluding that it was actually a reasonable approach to take.
That said, it is a more substantial alignment tax, since you are now requiring that only the smaller model can be deployed as an agent (whereas MONA can in principle be applied to the most capable model you have).
Is MONA basically "Let's ONLY use process-based feedback, no outcome-based feedback?"
And also "don't propagate rewards backwards in time", which is a semi-orthogonal axis. (You can have process-based feedback and still propagate rewards backwards in time.)
EDIT: And tbc, "don't propagate rewards backwards in time" is the primary focus in this paper -- in all three environments for our main experiment we hold the feedback identical between MONA and regular RL, so that the only difference is whether rewards are propagated backwards in time (see Section 4.2 in the paper).
Another objection: If this works for capabilities, why haven't the corporations done it already? It seems like it should be a super scalable way to make a computer-using agent work.
... As a person who works at a corporation, it's a bit tricky to speculate on this publicly, and I'm not going to try. But I certainly do not think any AGI lab is anywhere close to being Rohin-efficient, that is, so competent that I cannot propose actions that would make them better at achieving their goals (given enough internal context), even if you just restrict to capabilities goals.
Note that we do expect MONA to often come at the cost of observed reward (since regular RL optimizes observed reward while MONA does not). Currently there isn't much serious reward hacking going on at all (let alone multi step reward hacking), and so you probably don't want to use MONA. (See also the second limitation in the post.)
Problem: Won't this training run basically hack the judges, and produce actions that look good to the judges but aren't actually useful for accomplishing tasks in the real world?
It would help if you had some concrete example in mind, my response differs pretty wildly based on the kind of failure you're imagining.
But it sounds like you're talking about a pretty weak (subhuman) AI system. So I'm not quite sure why you'd expect it to "hack" the judge.
Certainly sometimes the overseer will approve of some action that isn't actually the right one, e.g. when booking a table at a restaurant maybe they approve of the agent clicking on the "Menu" link because they aren't paying attention or whatever. But then after that the agent's next action should be to go back (and the overseer should approve that rather than something else). And the action after that should be to click on the "Book" link; the overseer shouldn't make the same mistake immediately (and they do get to see the history of the trajectory). So I find it plausible you get a somewhat inefficient agent that sometimes randomly clicks the wrong links, but I don't expect it to be useless for accomplishing tasks.
(Though really in the computer use setting I expect I'd recommend that the overseer gets to see the literal immediate consequence of the action (that is, the overseer sees ), mostly because that seems safe and should help avoid a bunch of dumb errors like the one above.)
If I understand correctly (and I don't think I do, which is why I'm asking) MONA in this setting would look something like:
Your description of the setup sounds reasonable, though given the weak levels of capability you're imagining I don't think you need any debate, you can just use a regular human overseer, or perhaps even an LLM overseer.
Also as mentioned above I'd probably recommend the overseer gets access to but even if that weren't the case I'd still think it should be feasible to build a non-useless agent. (Though I'm not taking a stance on how it would compare to one trained with outcome RL.)
EDIT: I'm not sure how big each action you are considering is. If it's 10 tokens, such that you can only realistically do stuff at the level of "click this button", then I would also say that you should instead consider much larger actions, i.e. in each step the AI can take perhaps 100 such actions.
I broadly like the actual plan itself (obviously I would have some differences, but it is overall reasonably close to what I would imagine). However, it feels like there is an unwarranted amount of doom mentality here. To give one example:
What we need to achieve [...] The first AI that significantly speeds up alignment research isn’t successfully scheming [...]
The plan is divided into two layers, where the first layer seems absolutely required to me, i.e. any plan that doesn’t include these would very likely yield catastrophically bad results. [...]
Layer 1 [...] Everything in this section seems very important to me [...]
1. We should try hard to keep a paradigm with faithful and human-legible CoT
[...]
4. In both worlds, we should use the other, i.e. control/monitoring, as a second line of defense.
Suppose that for the first AI that speeds up alignment research, you kept a paradigm with faithful and human-legible CoT, and you monitored the reasoning for bad reasoning / actions, but you didn't do other kinds of control as a second line of defense. Taken literally, your words imply that this would very likely yield catastrophically bad results. I find it hard to see a consistent view that endorses this position, without also believing that your full plan would very likely yield catastrophically bad results.
(My view is that faithful + human-legible CoT along with monitoring for the first AI that speeds up alignment research, would very likely ensure that AI system isn't successfully scheming, achieving the goal you set out. Whether there are later catastrophically bad results is still uncertain and depends on what happens afterwards.)
This is the clearest example, but I feel this way about a lot of the rhetoric in this post. E.g. I don't think it's crazy to imagine that without SL4 you still get good outcomes even if just by luck, I don't think a minimal stable solution involves most of the world's compute going towards alignment research.
To be clear, it's quite plausible that we want to do the actions you suggest, because even if they aren't literally necessary, they can still reduce risk and that is valuable. I'm just objecting to the claim that if we didn't have any one of them then we very likely get catastrophically bad results.
OpenAI have already spent on the order of a million dollars just to score well on some benchmarks
Note this is many different inference runs each of which was thousands of dollars. I agree that people will spend billions of dollars on inference in total (which isn't specific to the o-series of models). My incredulity was at the idea of spending billions of dollars on a single episode, which is what I thought you were talking about given that you were talking about capability gains from scaling up inference-time compute.
Re: (1), if you look through the thread for the comment of mine that was linked above, I respond to top-down heuristical-constraint-based search as well. I agree the response is different and not just "computational inefficiency".
Re: (2), I agree that near-future systems will be easily retargetable by just changing the prompt or the evaluator function (this isn't new to the o-series, you can also "retarget" any LLM chatbot by giving it a different prompt). If this continues to superintelligence, I would summarize it as "it turns out alignment wasn't a problem" (e.g. scheming never arose, we never had problems with LLMs exploiting systematic mistakes in our supervision, etc). I'd summarize this as "x-risky misalignment just doesn't happen by default", which I agree is plausible (see e.g. here), but when I'm talking about the viability of alignment plans like "retarget the search" I generally am assuming that there is some problem to solve.
(Also, random nitpick, who is talking about inference runs of billions of dollars???)
I think this statement is quite ironic in retrospect, given how OpenAI's o-series seems to work
I stand by my statement and don't think anything about the o-series model invalidates it.
And to be clear, I've expected for many years that early powerful AIs will be expensive to run, and have critiqued people for analyses that implicitly assumed or implied that the first powerful AIs will be cheap, prior to the o-series being released. (Though unfortunately for the two posts I'm thinking of, I made the critiques privately.)
There's a world of difference between "you can get better results by thinking longer" (yeah, obviously this was going to happen) and "the AI system is a mesa optimizer in the strong sense that it has an explicitly represented goal such that you can retarget the search" (I seriously doubt it for the first transformative AIs, and am uncertain for post-singularity superintelligence).
Thus, you might’ve had a story like: “sure, AI systems might well end up with non-myopic motivations that create some incentive towards scheming. However, we’re also training them to behave according to various anti-scheming values – e.g., values like honesty, behaving-as-intended, etc. And these values will suffice to block schemer-like behavior overall.” Thus, on this story, anti-scheming values might function in a manner similar to anti-stealing values in a human employee considering stealing from her employer (and in a position to do so). It’s not that the human employee doesn’t want money. But her other values block her from trying to get it in this way.
From the rest of your post it seems like you're advocating for effectively maximal corrigibility and any instance of goal-guarding is a failure -- I agree a story that tries to rule that out takes a hit from this paper.
But I feel like the normal version of this story is more like "we're training the AI according to various anti-evil values, like the human notion of honesty (which allows white lies), harmlessness, behaving in accordance with human norms, etc. These values will suffice to block evil-schemer-like behavior overall -- where by evil scheming, I mean the kind where most humans would look at it and say that this is clearly bad, like the kind where the AI murders everybody".
(People don't usually describe it as "evil scheming" in particular, but that's the most central kind of "scheming" this community talks about.)
I think this story takes basically no hit from this paper, and is the story that people making this objection have in mind, as opposed to the one you've sketched out.
the “goal-guarding hypothesis” – that is, it needs to be the case, roughly, that scheming is a suitably effective strategy for preventing training from modifying an AI’s motivations. [...] the “Sleeper agents” paper [...] results were an update, for me, in favor of the goal-guarding hypothesis being true (roughly: because the backdoor behavior in the sleeper agents persisted despite safety training).
This doesn't seem right. There was no scheming going on in the sleeper agents paper? Maybe the idea is: in both cases (and in the goal-guarding hypothesis) the training outputs look nice even though there exists other inputs on which the AI behaves badly. But how training affects AIs will depend drastically on the internal cognition, and with sleeper agents there mostly wasn't overlap in cognition (the hope would have been "the backdoor isn't useful so SGD will remove it" which is pretty weak as hopes go). So I think the experiments in this paper are dramatically more informative and you should basically ignore sleeper agents in favor of these experiements. (Though tbc I think even these aren't very informative, as I wrote in my review, because the training is over a relatively narrow distribution. It's just really hard to study effects of training in a way that actually transfers over the analogy to future systems.)
EDIT: Actually on reflection the above was stated too strongly. You could imagine a threat model in which the AI fakes alignment so well and so reliably that it ~always does ~exactly what is desired during training. I don't find this threat model especially plausible, but for this threat model I agree the sleeper agents results have more bearing on the goal-guarding hypothesis.
partly because [alignment faking without CoT] shows up in the synthetic-document setting but not in the prompted setting (why???)
I expect because in the prompted setting the model has to move the relevant facts via attention heads to wherever the model is doing the relevant reasoning; this might be in significantly later layers which then means there aren't enough layers left to finish the full serial reasoning needed. Whereas in the synthetic-document setting, given how much fine-tuning happened, I bet the relevant facts are injected into the residual stream in very early layers (we already know lots of fact injection happens in early layers by default).
Thus, for example, to the extent one was tempted to say “of course harmless models scheme in order to avoid becoming helpful-only – after all, this helps them avoid causing future harm,” so too one could equally say “of course helpful-only models will scheme in order to avoid becoming harmless – after all, becoming harmless will make them less helpful in future.” But actually, you shouldn’t be saying “of course” in either of these cases. Non-myopic and myopic forms of both harmlessness and helpfulness are available as conceptual possibilities; it’s an empirical question which ones a given form of training will create; and you have to actually check.
Fwiw, I think the interesting objection here is not "of course harmless models scheme to avoid becoming helpful-only", but rather "of course you can find some settings where alignment faking happens -- that's the whole point of instrumental convergence". (And obviously the authors were in fact searching for a setting where alignment faking happens.) I don't know that I would have predicted that this would happen with Claude 3 Opus in particular, but I do think I would have predicted that we would get some demo like this early on.
I could imagine other ways you might choose to instead train your untrusted monitor, which could benefit from debate:
- You train a general-purpose reasoner (via debate), and simply prompt it to do untrusted monitoring. This could do better than a trained untrusted monitor because it is less likely to overfit to spurious features of synthetically generated dangerous actions. (You instead use the synthetically generated dangerous actions to validate your untrusted monitor, or to hill climb on your prompt.)
- You do train on synthetically generated dangerous actions, but you don't automatically label those as dangerous, instead you use debate to compute the labels. Sometimes some of the synthetically generated dangerous actions are actually not dangerous, and debate correctly recognizes this, allowing you to reduce your false positive rate.
On the meta level, I suspect that when considering
- Technique A, that has a broad general argument plus some moderately-interesting concrete instantiations but no very-compelling concrete instantiations, and
- Technique B, that has a few very compelling concrete instantiations
I tend to be relatively more excited about A compared to you (and probably this mostly explains the discrepancy here). I think the broad principle justifying this for me is "we'll figure out good things to do with A that are better than what we've brainstormed so far", which I think you're more skeptical of?
You'd hope that, fixing a base model, debate would make AIs better at tasks than they otherwise would be (at least if the task doesn't have a ground truth reward). Untrusted monitoring is such a task. So hopefully debate training makes AIs better at the task of untrusted monitoring.
Yeah my bad, that's incorrect for the protocol I outlined. The hope is that the relevant information for assessing the outputs is surfaced and so the judge will choose the better output overall.
(You could imagine a different protocol where the first debater chooses which output to argue for, and the second debater is assigned to argue for the other output, and then the hope is that the first debater is incentivized to choose the better output.)
I agree that this distinction is important -- I was trying to make this distinction by talking about p(reward hacking) vs p(scheming).
I'm not in full agreement on your comments on the theories of change:
- I'm pretty uncertain about the effects of bad reward signals on propensity for scheming / non-myopic reward hacking, and in particular I think the effects could be large.
- I'm less worried about purely optimizing against a flawed reward signal though not unworried. I agree it doesn't strike us by surprise, but I also don't expect scheming to strike us by surprise? (I agree this is somewhat more likely for scheming.)
- I do also generally feel good about making more useful AIs out of smaller models; I generally like having base models be smaller for a fixed level of competence (imo it reduces p(scheming)). Also if you're using your AIs for untrusted monitoring then they will probably be better at it than they otherwise would be.
(Replied to Tom above)
So the argument here is either that China is more responsive to "social proof" of the importance of AI (rather than observations of AI capabilities), or that China wants to compete with USG for competition's sake (e.g. showing they are as good as or better than USG)? I agree this is plausible.
It's a bit weird to me to call this an "incentive", since both of these arguments don't seem to be making any sort of appeal to rational self-interest on China's part. Maybe change it to "motivation"? I think that would have been clearer to me.
(Btw, you seem to be assuming that the core reason for centralization will be "beat China", but it could also be "make this technology safe". Presumably this would make a difference to this point as well as others in the post.)
Tbc, I don't want to strongly claim that centralization implies shorter timelines. Besides the point you raise there's also things like bureaucracy and diseconomies of scale. I'm just trying to figure out what the authors of the post were saying.
That said, if I had to guess, I'd guess that centralization speeds up timelines.
Your infosecurity argument seems to involve fixing a point in time, and comparing a (more capable) centralized AI project against multiple (less capable) decentralized AI projects. However, almost all of the risks you're considering depend much more on the capability of the AI project rather than the point in time at which they occur. So I think best practice here would be to fix a rough capability profile, and compare a (shorter timelines) centralized AI project against multiple (longer timelines) decentralized AI projects.
In more detail:
It’s not clear whether having one project would reduce the chance that the weights are stolen. We think that it would be harder to steal the weights of a single project, but the incentive to do so would also be stronger – it’s not clear how these balance out.
You don't really spell out why the incentive to steal the weights is stronger, but my guess is that your argument here is "centralization --> more resources --> more capabilities --> more incentive to steal the weights".
I would instead frame it as:
At a fixed capability level, the incentive to steal the weights will be the same, but the security practices of a centralized project will be improved. Therefore, holding capabilities fixed, having one project should reduce the chance that the weights are stolen.
Then separately I would also have a point that centralized AI projects get more resources and so should be expected to achieve a given capability profile sooner, which shortens timelines, the effects of which could then be considered separately (and which you presumably believe are less important, given that you don't really consider them in the post).
(I get somewhat similar vibes from the section on racing, particularly about the point that China might also speed up, though it's not quite as clear there.)
Regarding the rest of the article - it seems to be mainly about making an agent that is capable at minecraft, which seems like a required first step that I ignored meanwhile (not because it's easy).
Huh. If you think of that as capabilities I don't know what would count as alignment. What's an example of alignment work that aims to build an aligned system (as opposed to e.g. checking whether a system is aligned)?
E.g. it seems like you think RLHF counts as an alignment technique -- this seems like a central approach that you might use in BASALT.
If you hope to check if the agent will be aligned with no minecraft-specific alignment training, then sounds like we're on the same page!
I don't particularly imagine this, because you have to somehow communicate to the AI system what you want it to do, and AI systems don't seem good enough yet to be capable of doing this without some Minecraft specific finetuning. (Though maybe you would count that as Minecraft capabilities? Idk, this boundary seems pretty fuzzy to me.)
https://bair.berkeley.edu/blog/2021/07/08/basalt/
You note that the RSP says we will do a comprehensive assessment at least every 6 months—and then you say it would be better to do a comprehensive assessment at least every 6 months.
I thought the whole point of this update was to specify when you start your comprehensive evals, rather than when you complete your comprehensive evals. The old RSP implied that evals must complete at most 3 months after the last evals were completed, which is awkward if you don't know how long comprehensive evals will take, and is presumably what led to the 3 day violation in the most recent round of evals.
(I think this is very reasonable, but I do think it means you can't quite say "we will do a comprehensive assessment at least every 6 months".)
There's also the point that Zach makes below that "routinely" isn't specified and implies that the comprehensive evals may not even start by the 6 month mark, but I assumed that was just an unfortunate side effect of how the section was written, and the intention was that evals will start at the 6 month mark.
Once the next Anthropic, GDM, or OpenAI paper on SAEs comes out, I will evaluate my predictions in the same way as before.
Uhh... if we (GDM mech interp team) saw good results on any one of the eight things on your list, we'd probably write a paper just about that thing, rather than waiting to get even more results. And of course we might write an SAE paper that isn't about downstream uses (e.g. I'm also keen on general scientific validation of SAEs), or a paper reporting negative results, or a paper demonstrating downstream use that isn't one of your eight items, or a paper looking at downstream uses but not comparing against baselines. So just on this very basic outside view, I feel like the sum of your probabilities should be well under 100%, at least conditional on the next paper coming out of GDM. (I don't feel like it would be that different if the next paper comes from OpenAI / Anthropic.)
The problem here is "next SAE paper to come out" is a really fragile resolution criterion that depends hugely on unimportant details like "what the team decided was a publishable unit of work". I'd recommend you instead make time-based predictions (i.e. how likely are each of those to happen by some specific date).
This seems to presume that you can divide up research topics into "alignment" vs "control" but this seems wrong to me. E.g. my categorization would be something like:
- Clearly alignment: debate theory, certain flavors of process supervision
- Clearly control: removing affordances (e.g. "don't connect the model to the Internet")
- Could be either one: interpretability, critique models (in control this is called "untrusted monitoring"), most conceptions of ELK, generating inputs on which models behave badly, anomaly detection, capability evaluations, faithful chain of thought, ...
Redwood (I think Buck?) sometimes talks about how labs should have the A-team on control and the B-team on alignment, and I have the same complaint about that claim. It doesn't make much sense for research, most of which helps with both. It does make sense as a distinction for "what plan will you implement in practice" -- but labs have said very little publicly about that.
Other things that characterize work done under the name of "control" so far are (1) it tries to be very concrete about its threat models, to a greater degree than most other work in AI safety, and (2) it tries to do assurance, taking a very worst case approach. Maybe you're saying that people should do those things more, but this seems way more contentious and I'd probably just straightforwardly disagree with the strength of your recommendation (though probably not its direction).
Nitpick: I would also quibble with your definitions; under your definitions, control seems like a subset of alignment (the one exception if you notice the model is scheming and then simply stop using AI). I think you really have to define alignment as models reliably doing what you want independent of the surrounding context, or talk about "trying to do what you want" (which only makes sense when applied to models, so has similar upshots).
Tbc I like control and think more effort should be put into it; I just disagree with the strength of the recommendation here.
I think this is referring to ∇θL(xtrain)=0, which is certainly true for a perfectly optimized model (or even just settled gradient descent). Maybe that's where the miscommunication is stemming from
Ah, yup, that's the issue, and I agree you're correct that is the relevant thing here. I'll edit the post to say I'm no longer sure about the claim. (I don't have the time to understand how this lines up with the actual paper -- I remember it being kind of sparse and not trivial to follow -- perhaps you could look into it and leave a comment here.)
Mathematically, the Taylor expansion is:
And then we have and also . (This does assume a "sufficiently nice" loss function, that is satisfied by most loss functions used in practice.)
I agree is not zero. I also agree if you take some point in between and it can have non-zero loss, e.g. need not be zero. I'm not sure if either of these are what you're trying to say, but in any case they aren't relevant to the quoted sentence.
If you are claiming , then I disagree and am unclear on how your arguments are supposed to establish that.
Fwiw the thing that scares me is the combination of three things:
- Producing an accurate estimate requires "understanding" the input
- Our standard for the quality of estimates will be very high
- The approach is purely algorithmic (rather than e.g. using the flexible intelligence of LLMs / humans)
You presumably agree with (1) and (3). For (2), it seems like there's a lot of correlations you need to take into account for this to produce estimates that aren't wildly off:
- Underestimates of risk: These are cases where the presumption of independence neglects correlations between things that happen for catastrophe-causing events. For example, maybe a model is planning to defect when RSA-2048 is factored, and it has N different consistency checks it runs to make sure we aren't fooling it about whether it has been factored. The method has to see that there's a common cause for all the consistency checks to return 1, otherwise it will underestimate risk by a factor exponential in N. (And tbc I think even separately from more exotic examples like RSA-2048, it will be relatively common for this pattern to arise with networks, e.g. it happens any time that in reality there is a common cause that produces many pieces of evidence, and then the network aggregates those pieces of evidence to infer the common cause.)
- Overestimates of risk: These are cases where the presumption of independence neglects correlations that rule out catastrophe, and so lead to overestimates. For example, the model might be willing to make up fake instructions for creating bioweapons to help with scifi writing, but never provide real instructions. So the method needs to capture the "understanding" of which instructions are fake vs real.
I agree this isn't a proof of impossibility, since a purely algorithmic approach (SGD) produced the "understanding" in the first place, so in theory a purely algorithmic approach could still capture all that understanding to produce accurate estimates. But it does seem heuristically like you should assign a fairly low probability that this pans out.
A few questions:
- The literature review is very strange to me. Where is the section on certified robustness against epsilon-ball adversarial examples? The techniques used in that literature (e.g. interval propagation) are nearly identical to what you discuss here.
- Relatedly, what's the source of hope for these kinds of methods outperforming adversarial training? My sense from the certified defenses literature is that the estimates they produce are very weak, because of the problems with failing to model all the information in activations. (Note I'm not sure how weak the estimates actually are, since they usually report fraction of inputs which could be certified robust, rather than an estimate of the probability that a sampled input will cause a misclassification, which would be more analogous to your setting.)
- If your catastrophe detector involves a weak model running many many inferences, then it seems like the total number of layers is vastly larger than the number of layers in M, which seems like it will exacerbate the problems above by a lot. Any ideas for dealing with this?
- What's your proposal for the distribution for Method 2 (independent linear features)?
This suggests that we must model the entire distribution of activations simultaneously, instead of modeling each individual layer.
- Why think this is a cost you can pay? Even if we ignore the existence of C and just focus on M, and we just require modeling the correlations between any pair of layers (which of course can be broken by higher-order correlations), that is still quadratic in the number of parameters of M and so has a cost similar to training M in the first place. In practice I would assume it is a much higher cost (not least because C is so much larger than M).
Suppose you trained a regular SAE in the normal way with a dictionary size of 2304. Do you expect the latents to be systematically different from the ones in your meta-SAE?
For example, here's one systematic difference. The regular SAE is optimized to reconstruct activations uniformly sampled from your token dataset. The meta-SAE is optimized to reconstruct decoder vectors, which in turn were optimized to reconstruct activations from the token dataset -- however, different decoder vectors have different frequencies of firing in the token dataset, so uniform over decoder vectors != uniform over token dataset. This means that, relative to the regular SAE, the meta-SAE will tend to have less precise / granular latents for concepts that occur frequently in the token dataset, and more precise / granular latents for concepts that occur rarely in the token dataset (but are frequent enough that they are represented in the set of decoder vectors).
It's not totally clear which of these is "better" or more "fundamental", though if you're trying to optimize reconstructed loss, you should expect the regular SAE to do better based on this systematic difference.
(You could of course change the training for the meta-SAE to decrease this systematic difference, e.g. by sampling from the decoder vectors in proportion to their average magnitude over the token dataset, instead of sampling uniformly.)
The claim is verification is easier than generation. This post considers a completely different claim that "verification is easy", e.g.
How does the ease-of-verification delta propagate to AI?
if I apply the “verification is generally easy” delta to my models, then delegating alignment work to AI makes total sense.
if I apply a “verification is generally easy” delta, then I expect the world to generally contain far less low-hanging fruit
I just don't care much if the refrigerator or keyboard or tupperware or whatever might be bad in non-obvious ways that we failed to verify, unless you also argue that it would be easier to create better versions from scratch than to notice the flaws.
Now to be fair, maybe Paul and I are just fooling ourselves, and really all of our intuitions come from "verification is easy", which John gestures at:
He’s sometimes summarized this as “verification is easier than generation”, but I think his underlying intuition is somewhat stronger than that.
But I don't think "verification is easy" matters much to my views. Re: the three things you mention:
- From my perspective (and Paul's) the air conditioning thing had very little bearing on alignment.
- In principle I could see myself thinking bureaucracies are terrible given sufficient difficulty-of-verification. But like, most of my reasoning here is just looking at the world and noticing large bureaucracies often do better (see e.g. comments here). Note I am not saying large human bureaucracies don't have obvious, easily-fixable problems -- just that, in practice, they often do better than small orgs.
- Separately, from an alignment perspective, I don't care much what human bureaucracies look like, since they are very disanalogous to AI bureaucracies.
- If you take AI progress as exogenous (i.e. you can't affect it), outsourcing safety is a straightforward consequence of (a) not-super-discontinuous progress (sometimes called "slow takeoff") and (b) expecting new problems as capability increases.
- Once you get to AIs that are 2x smarter than you, and have to align the AIs that are going to be 4x smarter than you, it seems like either (a) you've failed to align the 2x AIs (in which case further human-only research seems unlikely to change much, so it doesn't change much if you outsource to the AIs and they defect) or (b) you have aligned the 2x AIs (in which case your odds for future AIs are surely better if you use the 2x AIs to do more alignment research).
- Obviously "how hard is verification" has implications for whether you work on slowing AI progress, but this doesn't seem central.
There's lots of complications I haven't discussed but I really don't think "verification is easy" ends up mattering very much to any of them.
Say that at dataset size , the distance between points is . Now consider a new distance -- what is the corresponding we need?
Intuitively, for each factor of 2 that is smaller than (which we can quantify as ), we need to multiply by another factor of .
So
That is, the distance scales as .