rohinmshah

(Meta: Going off of past experience I don't really expect to make much progress with more comments, so there's a decent chance I will bow out after this comment.)

I would expect bootstrapping will at most align a model as thoroughly as its predecessor was aligned (but probably less)

Why? Seems like it could go either way to me. To name one consideration in the opposite direction (without claiming this is the only consideration), the more powerful model can do a better job at finding the inputs on which the model would be misaligned, enabling you to train its successor across a wider variety of situations.

and goodhart's law definitely applies here.

I am having a hard time parsing this as having more content than "something could go wrong while bootstrapping". What is the metric that is undergoing optimization pressure during bootstrapping / amplified oversight that leads to decreased correlation with the true thing we should care about?

Is this intended only as a auditing mechanism, not a prevention mechanism

Yeah I'd expect debates to be an auditing mechanism if used at deployment time.

I also worry the "cheap system with high recall but low precision" will be too easy to fool for the system to be functional past a certain capability level.

Any alignment approach will always be subject to the critique "what if you failed and the AI became misaligned anyway and then past a certain capability level it evades all of your other defenses". I'm not trying to be robust to that critique.

I'm not saying I don't worry about fooling the cheap system -- I agree that's a failure mode to track. But useful conversation on this seems like it has to get into a more detailed argument, and at the very least has to be more contentful than "what if it didn't work".

The problem is RLHF already doesn't work

??? RLHF does work currently? What makes you think it doesn't work currently?

Comment by Rohin Shah (rohinmshah) on Google DeepMind: An Approach to Technical AGI Safety and Security · 2025-04-09T08:08:41.553Z · LW · GW

like being able to give the judge or debate partner the goal of actually trying to get to the truth

The idea is to set up a game in which the winning move is to be honest. There are theorems about the games that say something pretty close to this (though often they say "honesty is always a winning move" rather than "honesty is the only winning move"). These certainly depend on modeling assumptions but the assumptions are more like "assume the models are sufficiently capable" not "assume we can give them a goal". When applying this in practice there is also a clear divergence between what an equilibrium behavior is and what is found by RL in practice.

Despite all the caveats, I think it's wildly inaccurate to say that Amplified Oversight is assuming the ability to give the debate partner the goal of actually trying to get to the truth.

(I agree it is assuming that the judge has that goal, but I don't see why that's a terrible assumption.)

Are you stopping the agent periodically to have another debate about what it's working on and asking the human to review another debate?

You don't have to stop the agent, you can just do it afterwards.

can anybody provide a more detailed sketch of why they think Amplified Oversight will work and how it can be used to make agents safe in practice?

Have you read AI safety via debate? It has really quite a lot of conceptual points, making both the case in favor and considering several different reasons to worry.

(To be clear, there is more research that has made progress, e.g. cross-examination is a big deal imo, but I think the original debate paper is more than enough to get to the bar you're outlining here.)

Comment by Rohin Shah (rohinmshah) on Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2) · 2025-03-29T08:42:55.810Z · LW · GW

Rather, I think that most of the value lies in something more like "enabling oversight of cognition, despite not having data that isolates that cognition."

Is this a problem you expect to arise in practice? I don't really expect it to arise, if you're allowing for a significant amount of effort in creating that data (since I assume you'd also be putting a significant amount of effort into interpretability).

Comment by Rohin Shah (rohinmshah) on AGI Safety & Alignment @ Google DeepMind is hiring · 2025-03-05T07:43:51.186Z · LW · GW

We've got a lot of interest, so it's taking some time to go through applications. If you haven't heard back by the end of March, please ping me; hopefully it will be sooner than that.

Comment by Rohin Shah (rohinmshah) on AGI Safety & Alignment @ Google DeepMind is hiring · 2025-03-01T09:12:53.278Z · LW · GW

The answer to that question will determine which team will do the first review of your application. (We get enough applications that the first review costs quite a bit of time, so we don't want both teams to review all applications separately.)

You can still express interest in both teams (e.g. in the "Any other info" question), and the reviewer will take that into account and consider whether to move your application to the other team, but Gemini Safety reviewers aren't going to be as good at evaluating ASAT candidates, and vice versa, so you should choose the team that you think is a better fit for you.

Comment by Rohin Shah (rohinmshah) on AGI Safety & Alignment @ Google DeepMind is hiring · 2025-02-28T08:39:13.315Z · LW · GW

There are different interview processes. ASAT is more research-driven while Gemini Safety is more focused on execution and implementation. If you really don't know which of the two teams would be a better fit, you can submit a separate application for each.

Comment by Rohin Shah (rohinmshah) on AGI Safety & Alignment @ Google DeepMind is hiring · 2025-02-24T23:39:16.733Z · LW · GW

Our hiring this round is a small fraction of our overall team size, so this is really just correcting a minor imbalance, and shouldn't be taken as reflective of some big strategy. I'm guessing we'll go back to hiring a mix of the two around mid-2025.

Comment by Rohin Shah (rohinmshah) on AGI Safety & Alignment @ Google DeepMind is hiring · 2025-02-23T08:10:31.399Z · LW · GW

You can check out my career FAQ, as well as various other resources linked from there.

Comment by Rohin Shah (rohinmshah) on AGI Safety & Alignment @ Google DeepMind is hiring · 2025-02-18T20:19:08.559Z · LW · GW

Still pretty optimistic by the standards of the AGI safety field, somewhat shorter timelines than I reported in that post.

Neither of these really affect the work we do very much. I suppose if I were extremely pessimistic I would be doing something else, but even at a p(doom) of 50% I'd do basically the same things I'm doing now.

(And similarly individual team members have a wide variety of beliefs on both optimism and timelines. I actually don't know their beliefs on those topics very well because these beliefs are usually not that action-relevant for us.)

Comment by Rohin Shah (rohinmshah) on AGI Safety & Alignment @ Google DeepMind is hiring · 2025-02-18T10:12:47.291Z · LW · GW

More capability research than AGI safety research but idk what the ratio is and it's not something I can easily find out

Comment by Rohin Shah (rohinmshah) on AGI Safety & Alignment @ Google DeepMind is hiring · 2025-02-18T08:23:08.165Z · LW · GW

Since we have multiple roles, the interview process varies across candidates, but usually it would have around 3 stages that in total correspond to 4-8 hours of interviews.

Comment by Rohin Shah (rohinmshah) on AGI Safety & Alignment @ Google DeepMind is hiring · 2025-02-18T08:15:22.503Z · LW · GW

We'll leave it up until the later of those two (and probably somewhat beyond that, but that isn't guaranteed). I've edited the post.

Comment by Rohin Shah (rohinmshah) on Research directions Open Phil wants to fund in technical AI safety · 2025-02-14T10:54:00.570Z · LW · GW

Is that right?

Yes, that's broadly accurate, though one clarification:

This is not obvious because trying it out and measuring the effectiveness of MONA is somewhat costly

That's a reason (and is probably sufficient by itself), but I think a more important reason is that your first attempt at using MONA is at the point where problems arise, MONA will in fact be bad, whereas if you have iterated on it a bunch previously (and in particular you know how to provide appropriate nonmyopic approvals), your attempt at using MONA will go much better.

I think this will become much more likely once we actually start observing long-term optimization failures in prod.

Agreed, we're not advocating for using MONA now (and say so in the paper).

Maybe an intervention I am excited about is enough training technique transparency that it is possible for people outside of labs to notice if issues plausibly stems from long-term optimization?

Idk, to be effective I think this would need to be a pretty drastic increase in transparency, which seems incompatible with many security or non-proliferation intuitions, as well as business competitiveness concerns. (Unless you are thinking of lots of transparency to a very small set of people.)

Comment by Rohin Shah (rohinmshah) on Research directions Open Phil wants to fund in technical AI safety · 2025-02-11T23:34:38.277Z · LW · GW

If the situations where you imagine MONA helping are situations where you can't see the long-term optimization problems, I think you need a relatively strong second bullet point

That doesn't seem right. It can simultaneously be the case that you can't tell that there are problems stemming from long-term optimization problems when you don't use MONA, and also if you actually use MONA, then it will measurably improve quality.

For example, perhaps under normal RL you get a coding AI that has learned to skip error-checking code in order to reduce latency (which we'd penalize if we knew about it, but we don't realize that's happening). Later when things are put into production errors happen, but it's chalked up to "well it's hard to anticipate everything".

Instead you use MONA, and it doesn't learn to do this. You compare the resulting coding agent to the original agent, and notice that the MONA agent's lines of code are much more rarely implicated in future bugs, and conclude they are higher quality.

Comment by Rohin Shah (rohinmshah) on Research directions Open Phil wants to fund in technical AI safety · 2025-02-10T05:22:36.115Z · LW · GW

I meant "it's obvious you should use MONA if you are seeing problems with long-term optimization", which I believe is Fabien's position (otherwise it would be "hard to find").

Your reaction seems more like "it's obvious MONA would prevent multi-step reward hacks"; I expect that is somewhat more common (though still rare, and usually depends on already having the concept of multi-step reward hacking).

Comment by Rohin Shah (rohinmshah) on Research directions Open Phil wants to fund in technical AI safety · 2025-02-09T18:25:43.251Z · LW · GW

I have some credence in all three of those bullet points.

For MONA it's a relatively even mixture of the first and second points.

(You are possibly the first person I know of who reacted to MONA with "that's obvious" instead of "that obviously won't perform well, why would anyone ever do it". Admittedly you are imagining a future hypothetical where it's obvious to everyone that long-term optimization is causing problems, but I don't think it will clearly be obvious in advance that the long-term optimization is causing the problems, even if switching to MONA would measurably improve feedback quality.)

For debate it's mostly the first point, and to some extent the third point.

Comment by Rohin Shah (rohinmshah) on Research directions Open Phil wants to fund in technical AI safety · 2025-02-09T18:23:40.278Z · LW · GW

Got it, that makes more sense. (When you said "methods work on toy domains" I interpreted "work" as a verb rather than a noun.)

But maybe I am underestimating the amount of methods work that can be done on MONA for which it is reasonable to expect transfer to realistic settings

I think by far the biggest open question is "how do you provide the nonmyopic approval so that the model actually performs well". I don't think anyone has even attempted to tackle this so it's hard to tell what you could learn about it, but I'd be surprised if there weren't generalizable lessons to be learned.

I agree that there's not much benefit in "methods work" if that is understood as "work on the algorithm / code that given data + rewards / approvals translates it into gradient updates". I care a lot more about iterating on how to produce the data + rewards / approvals.

My guess is that if debate did "work" to improve average-case feedback quality, people working on capabilities (e.g. the big chunk of academia working on improvements to RLHF because they want to find techniques to make models more useful) would notice and use that to improve feedback quality.

I'd weakly bet against this, I think there will be lots of fiddly design decisions that you need to get right to actually see the benefits, plus iterating on this is expensive and hard because it involves multiagent RL. (Certainly this is true of our current efforts; the question is just whether this will remain true in the future.)

For example I am interested in [...] debate vs a default training process that incentivizes the sort of subtle reward hacking that doesn't show up in "feedback quality benchmarks (e.g. rewardbench)" but which increases risk (e.g. by making models more evil). But this sort of debate work is in the RFP.

I'm confused. This seems like the central example of work I'm talking about. Where is it in the RFP? (Note I am imagining that debate is itself a training process, but that seems to be what you're talking about as well.)

EDIT: And tbc this is the kind of thing I mean by "improving average-case feedback quality". I now feel like I don't know what you mean by "feedback quality".

Comment by Rohin Shah (rohinmshah) on Research directions Open Phil wants to fund in technical AI safety · 2025-02-08T22:35:56.917Z · LW · GW

I would have guessed that the day that labs actually want to use it for production runs, the methods work on toy domains and math will be useless, but I guess you disagree?

I think MONA could be used in production basically immediately; I think it was about as hard for us to do regular RL as it was to do MONA, though admittedly we didn't have to grapple as hard with the challenge of defining the approval feedback as I'd expect in a realistic deployment. But it does impose an alignment tax, so there's no point in using MONA currently, when good enough alignment is easy to achieve with RLHF and its variants, or RL on ground truth signals. I guess in some sense the question is "how big is the alignment tax", and I agree we don't know the answer to that yet and may not have enough understanding by the time it is relevant, but I don't really see why one would think "nah it'll only work in toy domains".

I agree debate doesn't work yet, though I think >50% chance we demonstrate decent results in some LLM domain (possibly a "toy" one) by the end of this year. Currently it seems to me like a key bottleneck (possibly the only one) is model capability, similarly to how model capability was a bottleneck to achieving the value of RL on ground truth until ~2024).

It also seems like it would still be useful if the methods were used some time after the labs want to use it for production runs.

It's wild to me that you're into moonshots when your objection to existing proposals is roughly "there isn't enough time for research to make them useful". Are you expecting the moonshots to be useful immediately?

Comment by Rohin Shah (rohinmshah) on Research directions Open Phil wants to fund in technical AI safety · 2025-02-08T22:11:40.545Z · LW · GW

I don't know of any existing work in this category, sorry. But e.g. one project would be "combine MONA and your favorite amplified oversight technique to oversee a hard multi-step task without ground truth rewards", which in theory could work better than either one of them alone.

Comment by Rohin Shah (rohinmshah) on Research directions Open Phil wants to fund in technical AI safety · 2025-02-08T09:56:28.531Z · LW · GW

I'm excited to see this RFP out! Many of the topics in here seem like great targets for safety work.

I'm sad that there's so little emphasis in this RFP about alignment, i.e. research on how to build an AI system that is doing what its developer intended it to do. The main area that seems directly relevant to alignment is "alternatives to adversarial training". (There's also "new moonshots for aligning superintelligence" but I don't expect much to come out of that, and "white-box estimation of rare misbehavior" could help if you are willing to put optimization pressure against it, but that isn't described as a core desideratum, and I don't expect we get it. Work on externalized reasoning can also make alignment easier, but I'm not counting that as "directly relevant".) Everything else seems to mostly focus on evaluation or fundamental science that we hope pays off in the future. (To be clear, I think those are good and we should clearly put a lot of effort into them, just not all of our effort.)

Areas that I'm more excited about relative to the median area in this post (including some of your starred areas):

Amplified oversight aka scalable oversight, where you are actually training the systems as in e.g. original debate.
Mild optimization. I'm particularly interested in MONA (fka process supervision, but that term now means something else) and how to make it competitive, but there may also be other good formulations of mild optimization to test out in LLMs.
Alignable systems design: Produce a design for an overall AI system that accomplishes something interesting, apply multiple safety techniques to it, and show that the resulting system is both capable and safe. (A lot of the value here is in figuring out how to combine various safety techniques together.)

To be fair, a lot of this work is much harder to execute on, requiring near-frontier models, significant amounts of compute, and infrastructure to run RL on LLMs, and so is much better suited to frontier labs than to academia / nonprofits, and you will get fewer papers per $ invested in it. Nonetheless, academia / nonprofits are so much bigger than frontier labs that I think academics should still be taking shots at these problems.

(And tbc there are plenty of other areas directly relevant to alignment that I'm less excited about, e.g. improved variants of RLHF / Constitutional AI, leveraging other modalities of feedback for LLMs (see Table 1 here), and "gradient descent psychology" (empirically studying how fine-tuning techniques affect LLM behavior).)

Control evaluations are an attempt to conservatively evaluate the safety of protocols like AI-critiquing-AI (e.g., McAleese et al.), AI debate (Arnesen et al., Khan et al.), other scalable oversight methods

Huh? No, control evaluations involve replacing untrusted models with adversarial models at inference time, whereas scalable oversight attempts to make trusted models at training time, so it would completely ignore the proposed mechanism to take the models produced by scalable oversight and replace them with adversarial models. (You could start out with adversarial models and see whether scalable oversight fixes them, but that's an alignment stress test.)

(A lot of the scalable oversight work from the last year is inference-only, but at least for me (and for the field historically) the goal is to scale this to training the AI system -- all of the theory depends on equilibrium behavior which you only get via training.)

Comment by Rohin Shah (rohinmshah) on Ten people on the inside · 2025-02-07T10:55:26.311Z · LW · GW

I don't think you should think of "poor info flows" as something that a company actively does, but rather as the default state of affairs for any fast-moving organization with 1000+ people. Such companies normally need to actively fight against poor info flows, resulting in not-maximally-terrible-but-still-bad info flows.

This is a case where I might be over indexing from experience at Google, but I'd currently bet that if you surveyed a representative set of Anthropic and OpenAI employees, more of them would mostly agree with that statement than mostly disagree with it.

(When I said that there are approximately zero useful things that don't make anyone's workflow harder, I definitely had in mind things like "you have to bug other people to get the info you need", it's just such a background part of my model that I didn't realize it was worth spelling out.)

Comment by Rohin Shah (rohinmshah) on In response to critiques of Guaranteed Safe AI · 2025-02-02T12:39:25.442Z · LW · GW

In broad strokes I agree with Zac. And tbc I'm generally a fan of formal verification and have done part of a PhD in program synthesis.

So, GSAI addresses the band of capabilities where AI systems become potentially no longer safe to interact with directly due to their potential ability to model and exploit human psychology

This seems like a great example of something that I strongly expect GSAI will not handle (unless the proposal reduces to "don't build such AIs", in which case I would appreciate that being stated more directly, or if it reduces to "limit the use of such AIs to tasks where we can formally verify soundness and uniqueness", in which case I'd like an estimate of what fraction of economically valuable work this corresponds to).

Can you sketch out how one produces a sound overapproximation of human psychology? Or how you construct a safety specification that the AIs won't exploit human psychology?

Comment by Rohin Shah (rohinmshah) on Ten people on the inside · 2025-01-30T10:55:11.968Z · LW · GW

I also agree with Zac, maybe if you had a really well-selected group of 10 people you could do something, but 10 randomly selected AGI safety researchers probably don't accomplish much.

By far my biggest objection is that there are approximately zero useful things that "[don't] make anyone's workflow harder". I expect you're vastly underestimating the complexity of production systems and companies that build them, and the number of constraints they are under. (You are assuming a do-ocracy though, depending on how much of a do-ocracy it is (e.g. willing to ignore laws) I could imagine changing my mind here.)

EDIT: I could imagine doing asynchronous monitoring of internal deployments. This is still going to make some workflows harder, but probably not a ton, so it seems surmountable. Especially since you could combine it with async analyses that the unreasonable developer actually finds useful.

EDIT 2 (Feb 7): To be clear, I also disagree with the compute number. I'm on board with starting with 1% since they are 1% of the headcount. But then I would decrease it first because they're not a high-compute capabilities team, and second because whatever they are doing should be less useful to the company than whatever the other researchers are doing (otherwise why weren't the other researchers doing it?), so maybe I'd estimate 0.3%. But this isn't super cruxy because I think you can do useful safety work with just 0.3% of the compute.

Again, I could imagine getting more compute with a well-selected group of 10 people (though even then 3% seems unlikely, I'm imagining more like 1%), but I don't see why in this scenario you should assume you get a well-selected group, as opposed to 10 random AGI safety researchers.

Comment by Rohin Shah (rohinmshah) on MONA: Managed Myopia with Approval Feedback · 2025-01-29T23:00:59.206Z · LW · GW

Yes, it's the same idea as the one you describe in your post. I'm pretty sure I also originally got this idea either via Paul or his blog posts (and also via Jonathan Uesato who I'm pretty sure got it via Paul). The rest of the authors got it via me and/or Jonathan Uesato. Obviously most of the work for the paper was not just having the idea, there were tons of details in the execution.

We do cite Paul's approval directed agents in the related work, which afaik is the closest thing to a public writeup from Paul on the topic. I had forgotten about your post at the time of publication, though weirdly I ran into it literally the day after we published everything.

But yes, mostly the goal from my perspective was (1) write the idea up more rigorously and clearly, (2) clarify where the safety benefits come from and distinguish clearly the difference from other stuff called "process supervision", (3) demonstrate benefits empirically. Also, nearly all of the authors have a much better sense of how this will work out in practice (even though I started the project with roughly as good an understanding of the idea as you had when writing your post, I think). I usually expect this type of effect with empirical projects but imo it was unusually large with this one.

Is that right?

Yup, that sounds basically right to me.

Comment by Rohin Shah (rohinmshah) on MONA: Managed Myopia with Approval Feedback · 2025-01-29T22:42:19.806Z · LW · GW

Indeed this seems related! We'd hope that there isn't too much conflict between the different "generations", since the intent is for the overseer to enable long-term plans, but it seems like something that could happen if the overseer is bad enough.

Comment by Rohin Shah (rohinmshah) on MONA: Managed Myopia with Approval Feedback · 2025-01-29T09:40:03.894Z · LW · GW

For a simple task like booking a restaurant, we could just ask the (frozen) overseer-AI to pick^[1] actions, no?

If it were as simple as "just ask an LLM to choose actions" someone would have deployed this product a while ago.

But in any case I agree this isn't the most interesting case for MONA, I talked about it because that's what Daniel asked about.

Comment by Rohin Shah (rohinmshah) on MONA: Managed Myopia with Approval Feedback · 2025-01-29T09:36:36.372Z · LW · GW

In our experiments on both Test-driven development and Loan applications you can see that the ground truth reward goes up with MONA. The ground truth reward at step 0 represents the reward the agent would obtain if it were frozen. So this looks like your option (3), assuming that the overseer and the agent were identical. (Partly this is because we are also mixing in non-AI sources of feedback, like whether the code runs and passes the tests, and whether the AI made the correct decision on the loan, but I think this is a realistic model of future AI development.)

In Test-driven development the argument above isn't quite correct, because we prompted the agent to be a bad programmer but didn't do this with the reward, so the overseer is "stronger" than the agent. However this was just because the agent is already very strongly finetuned to be good at coding so there was no headroom to climb, and we wanted to demonstrate that MONA would improve things if there was headroom to climb. I would bet that if we had a powerful model that wasn't yet finetuned strongly for coding, then we would once again see your option (3). The rewards are quite easy to provide -- just whether an individual test is valid and correct -- so I think a less capable model should be able to provide them, while still getting the benefits we see in the experiments we did run.

Comment by Rohin Shah (rohinmshah) on MONA: Managed Myopia with Approval Feedback · 2025-01-29T09:23:20.014Z · LW · GW

Discussed in the paper in Section 6.3, bullet point 3. Agreed that if you're using a prediction market it's no longer accurate to say that individual humans understand the strategy.

Comment by Rohin Shah (rohinmshah) on MONA: Managed Myopia with Approval Feedback · 2025-01-27T09:23:36.867Z · LW · GW

(We've seen this comment and are looking into options)

Comment by Rohin Shah (rohinmshah) on MONA: Managed Myopia with Approval Feedback · 2025-01-25T09:28:16.235Z · LW · GW

Us too! At the time we started this project, we tried some more realistic settings, but it was really hard to get multi-step RL working on LLMs. (Not MONA, just regular RL.) I expect it's more doable now.

For a variety of reasons the core team behind this paper has moved on to other things, so we won't get to it in the near future, but it would be great to see others working on this!

Comment by Rohin Shah (rohinmshah) on MONA: Managed Myopia with Approval Feedback · 2025-01-24T11:55:21.094Z · LW · GW

Thanks, and interesting generalization!

My thoughts depend on whether you train the weaker model.

If you are training the weaker model to solve the task, then the weaker model learns to simply always accept the advice of the stronger model, and stops being a constraint on what the stronger model can do.
If you aren't training the weaker model and are just training the stronger model based on whether it convinces the weak model, then you are probably not getting the benefits of RL (specifically: getting the model to generate actions that are evaluated to be good; this can amplify capability because evaluation is often easier / cheaper than generation)

I think (1) is pretty fatal to the proposal, but (2) is just a heuristic, I could imagine with more thought concluding that it was actually a reasonable approach to take.

That said, it is a more substantial alignment tax, since you are now requiring that only the smaller model can be deployed as an agent (whereas MONA can in principle be applied to the most capable model you have).

Comment by Rohin Shah (rohinmshah) on MONA: Managed Myopia with Approval Feedback · 2025-01-23T23:58:10.784Z · LW · GW

Is MONA basically "Let's ONLY use process-based feedback, no outcome-based feedback?"

And also "don't propagate rewards backwards in time", which is a semi-orthogonal axis. (You can have process-based feedback and still propagate rewards backwards in time.)

EDIT: And tbc, "don't propagate rewards backwards in time" is the primary focus in this paper -- in all three environments for our main experiment we hold the feedback identical between MONA and regular RL, so that the only difference is whether rewards are propagated backwards in time (see Section 4.2 in the paper).

Another objection: If this works for capabilities, why haven't the corporations done it already? It seems like it should be a super scalable way to make a computer-using agent work.

... As a person who works at a corporation, it's a bit tricky to speculate on this publicly, and I'm not going to try. But I certainly do not think any AGI lab is anywhere close to being Rohin-efficient, that is, so competent that I cannot propose actions that would make them better at achieving their goals (given enough internal context), even if you just restrict to capabilities goals.

Note that we do expect MONA to often come at the cost of observed reward (since regular RL optimizes observed reward while MONA does not). Currently there isn't much serious reward hacking going on at all (let alone multi step reward hacking), and so you probably don't want to use MONA. (See also the second limitation in the post.)

Comment by Rohin Shah (rohinmshah) on MONA: Managed Myopia with Approval Feedback · 2025-01-23T18:16:06.470Z · LW · GW

Problem: Won't this training run basically hack the judges, and produce actions that look good to the judges but aren't actually useful for accomplishing tasks in the real world?

It would help if you had some concrete example in mind, my response differs pretty wildly based on the kind of failure you're imagining.

But it sounds like you're talking about a pretty weak (subhuman) AI system. So I'm not quite sure why you'd expect it to "hack" the judge.

Certainly sometimes the overseer will approve of some action that isn't actually the right one, e.g. when booking a table at a restaurant maybe they approve of the agent clicking on the "Menu" link because they aren't paying attention or whatever. But then after that the agent's next action should be to go back (and the overseer should approve that rather than something else). And the action after that should be to click on the "Book" link; the overseer shouldn't make the same mistake immediately (and they do get to see the history of the trajectory). So I find it plausible you get a somewhat inefficient agent that sometimes randomly clicks the wrong links, but I don't expect it to be useless for accomplishing tasks.

(Though really in the computer use setting I expect I'd recommend that the overseer gets to see the literal immediate consequence of the action (that is, the overseer sees ), mostly because that seems safe and should help avoid a bunch of dumb errors like the one above.)

If I understand correctly (and I don't think I do, which is why I'm asking) MONA in this setting would look something like:

Your description of the setup sounds reasonable, though given the weak levels of capability you're imagining I don't think you need any debate, you can just use a regular human overseer, or perhaps even an LLM overseer.

Also as mentioned above I'd probably recommend the overseer gets access to $s_{i + 1}$ but even if that weren't the case I'd still think it should be feasible to build a non-useless agent. (Though I'm not taking a stance on how it would compare to one trained with outcome RL.)

EDIT: I'm not sure how big each action you are considering is. If it's 10 tokens, such that you can only realistically do stuff at the level of "click this button", then I would also say that you should instead consider much larger actions, i.e. in each step the AI can take perhaps 100 such actions.

Comment by Rohin Shah (rohinmshah) on What’s the short timeline plan? · 2025-01-04T09:25:56.121Z · LW · GW

I broadly like the actual plan itself (obviously I would have some differences, but it is overall reasonably close to what I would imagine). However, it feels like there is an unwarranted amount of doom mentality here. To give one example:

What we need to achieve [...] The first AI that significantly speeds up alignment research isn’t successfully scheming [...]
The plan is divided into two layers, where the first layer seems absolutely required to me, i.e. any plan that doesn’t include these would very likely yield catastrophically bad results. [...]
Layer 1 [...] Everything in this section seems very important to me [...]
1. We should try hard to keep a paradigm with faithful and human-legible CoT
[...]
4. In both worlds, we should use the other, i.e. control/monitoring, as a second line of defense.

Suppose that for the first AI that speeds up alignment research, you kept a paradigm with faithful and human-legible CoT, and you monitored the reasoning for bad reasoning / actions, but you didn't do other kinds of control as a second line of defense. Taken literally, your words imply that this would very likely yield catastrophically bad results. I find it hard to see a consistent view that endorses this position, without also believing that your full plan would very likely yield catastrophically bad results.

(My view is that faithful + human-legible CoT along with monitoring for the first AI that speeds up alignment research, would very likely ensure that AI system isn't successfully scheming, achieving the goal you set out. Whether there are later catastrophically bad results is still uncertain and depends on what happens afterwards.)

This is the clearest example, but I feel this way about a lot of the rhetoric in this post. E.g. I don't think it's crazy to imagine that without SL4 you still get good outcomes even if just by luck, I don't think a minimal stable solution involves most of the world's compute going towards alignment research.

To be clear, it's quite plausible that we want to do the actions you suggest, because even if they aren't literally necessary, they can still reduce risk and that is valuable. I'm just objecting to the claim that if we didn't have any one of them then we very likely get catastrophically bad results.

Comment by Rohin Shah (rohinmshah) on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-29T19:07:28.045Z · LW · GW

OpenAI have already spent on the order of a million dollars just to score well on some benchmarks

Note this is many different inference runs each of which was thousands of dollars. I agree that people will spend billions of dollars on inference in total (which isn't specific to the o-series of models). My incredulity was at the idea of spending billions of dollars on a single episode, which is what I thought you were talking about given that you were talking about capability gains from scaling up inference-time compute.

Comment by Rohin Shah (rohinmshah) on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-29T17:05:31.844Z · LW · GW

Re: (1), if you look through the thread for the comment of mine that was linked above, I respond to top-down heuristical-constraint-based search as well. I agree the response is different and not just "computational inefficiency".

Re: (2), I agree that near-future systems will be easily retargetable by just changing the prompt or the evaluator function (this isn't new to the o-series, you can also "retarget" any LLM chatbot by giving it a different prompt). If this continues to superintelligence, I would summarize it as "it turns out alignment wasn't a problem" (e.g. scheming never arose, we never had problems with LLMs exploiting systematic mistakes in our supervision, etc). I'd summarize this as "x-risky misalignment just doesn't happen by default", which I agree is plausible (see e.g. here), but when I'm talking about the viability of alignment plans like "retarget the search" I generally am assuming that there is some problem to solve.

(Also, random nitpick, who is talking about inference runs of billions of dollars???)

Comment by Rohin Shah (rohinmshah) on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-29T16:17:39.024Z · LW · GW

I think this statement is quite ironic in retrospect, given how OpenAI's o-series seems to work

I stand by my statement and don't think anything about the o-series model invalidates it.

And to be clear, I've expected for many years that early powerful AIs will be expensive to run, and have critiqued people for analyses that implicitly assumed or implied that the first powerful AIs will be cheap, prior to the o-series being released. (Though unfortunately for the two posts I'm thinking of, I made the critiques privately.)

There's a world of difference between "you can get better results by thinking longer" (yeah, obviously this was going to happen) and "the AI system is a mesa optimizer in the strong sense that it has an explicitly represented goal such that you can retarget the search" (I seriously doubt it for the first transformative AIs, and am uncertain for post-singularity superintelligence).

Comment by Rohin Shah (rohinmshah) on Takes on "Alignment Faking in Large Language Models" · 2024-12-20T06:27:01.422Z · LW · GW

Thus, you might’ve had a story like: “sure, AI systems might well end up with non-myopic motivations that create some incentive towards scheming. However, we’re also training them to behave according to various anti-scheming values – e.g., values like honesty, behaving-as-intended, etc. And these values will suffice to block schemer-like behavior overall.” Thus, on this story, anti-scheming values might function in a manner similar to anti-stealing values in a human employee considering stealing from her employer (and in a position to do so). It’s not that the human employee doesn’t want money. But her other values block her from trying to get it in this way.

From the rest of your post it seems like you're advocating for effectively maximal corrigibility and any instance of goal-guarding is a failure -- I agree a story that tries to rule that out takes a hit from this paper.

But I feel like the normal version of this story is more like "we're training the AI according to various anti-evil values, like the human notion of honesty (which allows white lies), harmlessness, behaving in accordance with human norms, etc. These values will suffice to block evil-schemer-like behavior overall -- where by evil scheming, I mean the kind where most humans would look at it and say that this is clearly bad, like the kind where the AI murders everybody".

(People don't usually describe it as "evil scheming" in particular, but that's the most central kind of "scheming" this community talks about.)

I think this story takes basically no hit from this paper, and is the story that people making this objection have in mind, as opposed to the one you've sketched out.

the “goal-guarding hypothesis” – that is, it needs to be the case, roughly, that scheming is a suitably effective strategy for preventing training from modifying an AI’s motivations. [...] the “Sleeper agents” paper [...] results were an update, for me, in favor of the goal-guarding hypothesis being true (roughly: because the backdoor behavior in the sleeper agents persisted despite safety training).

This doesn't seem right. There was no scheming going on in the sleeper agents paper? Maybe the idea is: in both cases (and in the goal-guarding hypothesis) the training outputs look nice even though there exists other inputs on which the AI behaves badly. But how training affects AIs will depend drastically on the internal cognition, and with sleeper agents there mostly wasn't overlap in cognition (the hope would have been "the backdoor isn't useful so SGD will remove it" which is pretty weak as hopes go). So I think the experiments in this paper are dramatically more informative and you should basically ignore sleeper agents in favor of these experiements. (Though tbc I think even these aren't very informative, as I wrote in my review, because the training is over a relatively narrow distribution. It's just really hard to study effects of training in a way that actually transfers over the analogy to future systems.)

EDIT: Actually on reflection the above was stated too strongly. You could imagine a threat model in which the AI fakes alignment so well and so reliably that it ~always does ~exactly what is desired during training. I don't find this threat model especially plausible, but for this threat model I agree the sleeper agents results have more bearing on the goal-guarding hypothesis.

partly because [alignment faking without CoT] shows up in the synthetic-document setting but not in the prompted setting (why???)

I expect because in the prompted setting the model has to move the relevant facts via attention heads to wherever the model is doing the relevant reasoning; this might be in significantly later layers which then means there aren't enough layers left to finish the full serial reasoning needed. Whereas in the synthetic-document setting, given how much fine-tuning happened, I bet the relevant facts are injected into the residual stream in very early layers (we already know lots of fact injection happens in early layers by default).

Thus, for example, to the extent one was tempted to say “of course harmless models scheme in order to avoid becoming helpful-only – after all, this helps them avoid causing future harm,” so too one could equally say “of course helpful-only models will scheme in order to avoid becoming harmless – after all, becoming harmless will make them less helpful in future.” But actually, you shouldn’t be saying “of course” in either of these cases. Non-myopic and myopic forms of both harmlessness and helpfulness are available as conceptual possibilities; it’s an empirical question which ones a given form of training will create; and you have to actually check.

Fwiw, I think the interesting objection here is not "of course harmless models scheme to avoid becoming helpful-only", but rather "of course you can find some settings where alignment faking happens -- that's the whole point of instrumental convergence". (And obviously the authors were in fact searching for a setting where alignment faking happens.) I don't know that I would have predicted that this would happen with Claude 3 Opus in particular, but I do think I would have predicted that we would get some demo like this early on.

Comment by Rohin Shah (rohinmshah) on Zach Stein-Perlman's Shortform · 2024-12-17T23:22:58.363Z · LW · GW

I could imagine other ways you might choose to instead train your untrusted monitor, which could benefit from debate:

You train a general-purpose reasoner (via debate), and simply prompt it to do untrusted monitoring. This could do better than a trained untrusted monitor because it is less likely to overfit to spurious features of synthetically generated dangerous actions. (You instead use the synthetically generated dangerous actions to validate your untrusted monitor, or to hill climb on your prompt.)
You do train on synthetically generated dangerous actions, but you don't automatically label those as dangerous, instead you use debate to compute the labels. Sometimes some of the synthetically generated dangerous actions are actually not dangerous, and debate correctly recognizes this, allowing you to reduce your false positive rate.

On the meta level, I suspect that when considering

Technique A, that has a broad general argument plus some moderately-interesting concrete instantiations but no very-compelling concrete instantiations, and
Technique B, that has a few very compelling concrete instantiations

I tend to be relatively more excited about A compared to you (and probably this mostly explains the discrepancy here). I think the broad principle justifying this for me is "we'll figure out good things to do with A that are better than what we've brainstormed so far", which I think you're more skeptical of?

Comment by Rohin Shah (rohinmshah) on Zach Stein-Perlman's Shortform · 2024-12-17T05:41:16.730Z · LW · GW

You'd hope that, fixing a base model, debate would make AIs better at tasks than they otherwise would be (at least if the task doesn't have a ground truth reward). Untrusted monitoring is such a task. So hopefully debate training makes AIs better at the task of untrusted monitoring.

Comment by Rohin Shah (rohinmshah) on Zach Stein-Perlman's Shortform · 2024-12-17T04:19:59.921Z · LW · GW

Yeah my bad, that's incorrect for the protocol I outlined. The hope is that the relevant information for assessing the outputs is surfaced and so the judge will choose the better output overall.

(You could imagine a different protocol where the first debater chooses which output to argue for, and the second debater is assigned to argue for the other output, and then the hope is that the first debater is incentivized to choose the better output.)

Comment by Rohin Shah (rohinmshah) on Zach Stein-Perlman's Shortform · 2024-12-17T04:16:10.514Z · LW · GW

I agree that this distinction is important -- I was trying to make this distinction by talking about p(reward hacking) vs p(scheming).

I'm not in full agreement on your comments on the theories of change:

I'm pretty uncertain about the effects of bad reward signals on propensity for scheming / non-myopic reward hacking, and in particular I think the effects could be large.
I'm less worried about purely optimizing against a flawed reward signal though not unworried. I agree it doesn't strike us by surprise, but I also don't expect scheming to strike us by surprise? (I agree this is somewhat more likely for scheming.)
I do also generally feel good about making more useful AIs out of smaller models; I generally like having base models be smaller for a fixed level of competence (imo it reduces p(scheming)). Also if you're using your AIs for untrusted monitoring then they will probably be better at it than they otherwise would be.

Comment by Rohin Shah (rohinmshah) on Should there be just one western AGI project? · 2024-12-09T18:52:58.258Z · LW · GW

(Replied to Tom above)

Comment by Rohin Shah (rohinmshah) on Should there be just one western AGI project? · 2024-12-09T18:51:09.937Z · LW · GW

So the argument here is either that China is more responsive to "social proof" of the importance of AI (rather than observations of AI capabilities), or that China wants to compete with USG for competition's sake (e.g. showing they are as good as or better than USG)? I agree this is plausible.

It's a bit weird to me to call this an "incentive", since both of these arguments don't seem to be making any sort of appeal to rational self-interest on China's part. Maybe change it to "motivation"? I think that would have been clearer to me.

(Btw, you seem to be assuming that the core reason for centralization will be "beat China", but it could also be "make this technology safe". Presumably this would make a difference to this point as well as others in the post.)

Comment by Rohin Shah (rohinmshah) on Should there be just one western AGI project? · 2024-12-08T22:02:46.460Z · LW · GW

Tbc, I don't want to strongly claim that centralization implies shorter timelines. Besides the point you raise there's also things like bureaucracy and diseconomies of scale. I'm just trying to figure out what the authors of the post were saying.

That said, if I had to guess, I'd guess that centralization speeds up timelines.

Comment by Rohin Shah (rohinmshah) on Should there be just one western AGI project? · 2024-12-07T21:17:51.845Z · LW · GW

Your infosecurity argument seems to involve fixing a point in time, and comparing a (more capable) centralized AI project against multiple (less capable) decentralized AI projects. However, almost all of the risks you're considering depend much more on the capability of the AI project rather than the point in time at which they occur. So I think best practice here would be to fix a rough capability profile, and compare a (shorter timelines) centralized AI project against multiple (longer timelines) decentralized AI projects.

In more detail:

It’s not clear whether having one project would reduce the chance that the weights are stolen. We think that it would be harder to steal the weights of a single project, but the incentive to do so would also be stronger – it’s not clear how these balance out.

You don't really spell out why the incentive to steal the weights is stronger, but my guess is that your argument here is "centralization --> more resources --> more capabilities --> more incentive to steal the weights".

I would instead frame it as:

At a fixed capability level, the incentive to steal the weights will be the same, but the security practices of a centralized project will be improved. Therefore, holding capabilities fixed, having one project should reduce the chance that the weights are stolen.

Then separately I would also have a point that centralized AI projects get more resources and so should be expected to achieve a given capability profile sooner, which shortens timelines, the effects of which could then be considered separately (and which you presumably believe are less important, given that you don't really consider them in the post).

(I get somewhat similar vibes from the section on racing, particularly about the point that China might also speed up, though it's not quite as clear there.)

Comment by Rohin Shah (rohinmshah) on Yonatan Cale's Shortform · 2024-11-30T14:07:45.003Z · LW · GW

Regarding the rest of the article - it seems to be mainly about making an agent that is capable at minecraft, which seems like a required first step that I ignored meanwhile (not because it's easy).

Huh. If you think of that as capabilities I don't know what would count as alignment. What's an example of alignment work that aims to build an aligned system (as opposed to e.g. checking whether a system is aligned)?

E.g. it seems like you think RLHF counts as an alignment technique -- this seems like a central approach that you might use in BASALT.

If you hope to check if the agent will be aligned with no minecraft-specific alignment training, then sounds like we're on the same page!

I don't particularly imagine this, because you have to somehow communicate to the AI system what you want it to do, and AI systems don't seem good enough yet to be capable of doing this without some Minecraft specific finetuning. (Though maybe you would count that as Minecraft capabilities? Idk, this boundary seems pretty fuzzy to me.)

Comment by Rohin Shah (rohinmshah) on Yonatan Cale's Shortform · 2024-11-28T09:01:30.459Z · LW · GW

https://bair.berkeley.edu/blog/2021/07/08/basalt/

Comment by Rohin Shah (rohinmshah) on Anthropic rewrote its RSP · 2024-10-15T20:11:08.247Z · LW · GW

You note that the RSP says we will do a comprehensive assessment at least every 6 months—and then you say it would be better to do a comprehensive assessment at least every 6 months.

I thought the whole point of this update was to specify when you start your comprehensive evals, rather than when you complete your comprehensive evals. The old RSP implied that evals must complete at most 3 months after the last evals were completed, which is awkward if you don't know how long comprehensive evals will take, and is presumably what led to the 3 day violation in the most recent round of evals.

(I think this is very reasonable, but I do think it means you can't quite say "we will do a comprehensive assessment at least every 6 months".)

There's also the point that Zach makes below that "routinely" isn't specified and implies that the comprehensive evals may not even start by the 6 month mark, but I assumed that was just an unfortunate side effect of how the section was written, and the intention was that evals will start at the 6 month mark.

Comment by Rohin Shah (rohinmshah) on EIS XIV: Is mechanistic interpretability about to be practically useful? · 2024-10-12T14:39:33.086Z · LW · GW

Once the next Anthropic, GDM, or OpenAI paper on SAEs comes out, I will evaluate my predictions in the same way as before.

Uhh... if we (GDM mech interp team) saw good results on any one of the eight things on your list, we'd probably write a paper just about that thing, rather than waiting to get even more results. And of course we might write an SAE paper that isn't about downstream uses (e.g. I'm also keen on general scientific validation of SAEs), or a paper reporting negative results, or a paper demonstrating downstream use that isn't one of your eight items, or a paper looking at downstream uses but not comparing against baselines. So just on this very basic outside view, I feel like the sum of your probabilities should be well under 100%, at least conditional on the next paper coming out of GDM. (I don't feel like it would be that different if the next paper comes from OpenAI / Anthropic.)

The problem here is "next SAE paper to come out" is a really fragile resolution criterion that depends hugely on unimportant details like "what the team decided was a publishable unit of work". I'd recommend you instead make time-based predictions (i.e. how likely are each of those to happen by some specific date).

User info

Posts

Comments