Posts

MONA: Managed Myopia with Approval Feedback 2025-01-23T12:24:18.108Z
AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work 2024-08-20T16:22:45.888Z
On scalable oversight with weak LLMs judging strong LLMs 2024-07-08T08:59:58.523Z
Improving Dictionary Learning with Gated Sparse Autoencoders 2024-04-25T18:43:47.003Z
AtP*: An efficient and scalable method for localizing LLM behaviour to components 2024-03-18T17:28:37.513Z
Fact Finding: Do Early Layers Specialise in Local Processing? (Post 5) 2023-12-23T02:46:25.892Z
Fact Finding: How to Think About Interpreting Memorisation (Post 4) 2023-12-23T02:46:16.675Z
Fact Finding: Trying to Mechanistically Understanding Early MLPs (Post 3) 2023-12-23T02:46:05.517Z
Fact Finding: Simplifying the Circuit (Post 2) 2023-12-23T02:45:49.675Z
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1) 2023-12-23T02:44:24.270Z
Discussion: Challenges with Unsupervised LLM Knowledge Discovery 2023-12-18T11:58:39.379Z
Explaining grokking through circuit efficiency 2023-09-08T14:39:23.910Z
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla 2023-07-20T10:50:58.611Z
Shah (DeepMind) and Leahy (Conjecture) Discuss Alignment Cruxes 2023-05-01T16:47:41.655Z
[Linkpost] Some high-level thoughts on the DeepMind alignment team's strategy 2023-03-07T11:55:01.131Z
Categorizing failures as “outer” or “inner” misalignment is often confused 2023-01-06T15:48:51.739Z
Definitions of “objective” should be Probable and Predictive 2023-01-06T15:40:30.813Z
Refining the Sharp Left Turn threat model, part 2: applying alignment techniques 2022-11-25T14:36:08.948Z
Threat Model Literature Review 2022-11-01T11:03:22.610Z
Clarifying AI X-risk 2022-11-01T11:03:01.144Z
More examples of goal misgeneralization 2022-10-07T14:38:00.288Z
[AN #173] Recent language model results from DeepMind 2022-07-21T02:30:02.115Z
[AN #172] Sorry for the long hiatus! 2022-07-05T06:20:03.943Z
DeepMind is hiring for the Scalable Alignment and Alignment Teams 2022-05-13T12:17:13.157Z
Learning the smooth prior 2022-04-29T21:10:18.064Z
Shah and Yudkowsky on alignment failures 2022-02-28T19:18:23.015Z
[AN #171]: Disagreements between alignment "optimists" and "pessimists" 2022-01-21T18:30:04.824Z
Conversation on technology forecasting and gradualism 2021-12-09T21:23:21.187Z
[AN #170]: Analyzing the argument for risk from power-seeking AI 2021-12-08T18:10:04.022Z
[AN #169]: Collaborating with humans without human data 2021-11-24T18:30:03.795Z
[AN #168]: Four technical topics for which Open Phil is soliciting grant proposals 2021-10-28T17:20:03.387Z
[AN #167]: Concrete ML safety problems and their relevance to x-risk 2021-10-20T17:10:03.690Z
[AN #166]: Is it crazy to claim we're in the most important century? 2021-10-08T17:30:11.819Z
[AN #165]: When large models are more likely to lie 2021-09-22T17:30:04.674Z
[AN #164]: How well can language models write code? 2021-09-15T17:20:03.850Z
[AN #163]: Using finite factored sets for causal and temporal inference 2021-09-08T17:20:04.522Z
[AN #162]: Foundation models: a paradigm shift within AI 2021-08-27T17:20:03.831Z
[AN #161]: Creating generalizable reward functions for multiple tasks by learning a model of functional similarity 2021-08-20T17:20:04.380Z
[AN #160]: Building AIs that learn and think like people 2021-08-13T17:10:04.335Z
[AN #159]: Building agents that know how to experiment, by training on procedurally generated games 2021-08-04T17:10:03.823Z
[AN #158]: Should we be optimistic about generalization? 2021-07-29T17:20:03.409Z
[AN #157]: Measuring misalignment in the technology underlying Copilot 2021-07-23T17:20:03.424Z
[AN #156]: The scaling hypothesis: a plan for building AGI 2021-07-16T17:10:05.809Z
BASALT: A Benchmark for Learning from Human Feedback 2021-07-08T17:40:35.045Z
[AN #155]: A Minecraft benchmark for algorithms that learn without reward functions 2021-07-08T17:20:02.518Z
[AN #154]: What economic growth theory has to say about transformative AI 2021-06-30T17:20:03.292Z
[AN #153]: Experiments that demonstrate failures of objective robustness 2021-06-26T17:10:02.819Z
[AN #152]: How we’ve overestimated few-shot learning capabilities 2021-06-16T17:20:04.454Z
[AN #151]: How sparsity in the final layer makes a neural net debuggable 2021-05-19T17:20:04.453Z
[AN #150]: The subtypes of Cooperative AI research 2021-05-12T17:20:27.267Z

Comments

Comment by Rohin Shah (rohinmshah) on Research directions Open Phil wants to fund in technical AI safety · 2025-02-08T09:56:28.531Z · LW · GW

I'm excited to see this RFP out! Many of the topics in here seem like great targets for safety work.

I'm sad that there's so little emphasis in this RFP about alignment, i.e. research on how to build an AI system that is doing what its developer intended it to do. The main area that seems directly relevant to alignment is "alternatives to adversarial training". (There's also "new moonshots for aligning superintelligence" but I don't expect much to come out of that, and "white-box estimation of rare misbehavior" could help if you are willing to put optimization pressure against it, but that isn't described as a core desideratum, and I don't expect we get it. Work on externalized reasoning can also make alignment easier, but I'm not counting that as "directly relevant".) Everything else seems to mostly focus on evaluation or fundamental science that we hope pays off in the future. (To be clear, I think those are good and we should clearly put a lot of effort into them, just not all of our effort.)

Areas that I'm more excited about relative to the median area in this post (including some of your starred areas):

  • Amplified oversight aka scalable oversight, where you are actually training the systems as in e.g. original debate.
  • Mild optimization. I'm particularly interested in MONA (fka process supervision, but that term now means something else) and how to make it competitive, but there may also be other good formulations of mild optimization to test out in LLMs.
  • Alignable systems design: Produce a design for an overall AI system that accomplishes something interesting, apply multiple safety techniques to it, and show that the resulting system is both capable and safe. (A lot of the value here is in figuring out how to combine various safety techniques together.)

To be fair, a lot of this work is much harder to execute on, requiring near-frontier models, significant amounts of compute, and infrastructure to run RL on LLMs, and so is much better suited to frontier labs than to academia / nonprofits, and you will get fewer papers per $ invested in it. Nonetheless, academia / nonprofits are so much bigger than frontier labs that I think academics should still be taking shots at these problems.

(And tbc there are plenty of other areas directly relevant to alignment that I'm less excited about, e.g. improved variants of RLHF / Constitutional AI, leveraging other modalities of feedback for LLMs (see Table 1 here), and "gradient descent psychology" (empirically studying how fine-tuning techniques affect LLM behavior).)

Control evaluations are an attempt to conservatively evaluate the safety of protocols like AI-critiquing-AI (e.g., McAleese et al.), AI debate (Arnesen et al., Khan et al.), other scalable oversight methods

Huh? No, control evaluations involve replacing untrusted models with adversarial models at inference time, whereas scalable oversight attempts to make trusted models at training time, so it would completely ignore the proposed mechanism to take the models produced by scalable oversight and replace them with adversarial models. (You could start out with adversarial models and see whether scalable oversight fixes them, but that's an alignment stress test.)

(A lot of the scalable oversight work from the last year is inference-only, but at least for me (and for the field historically) the goal is to scale this to training the AI system -- all of the theory depends on equilibrium behavior which you only get via training.)

Comment by Rohin Shah (rohinmshah) on Ten people on the inside · 2025-02-07T10:55:26.311Z · LW · GW

I don't think you should think of "poor info flows" as something that a company actively does, but rather as the default state of affairs for any fast-moving organization with 1000+ people. Such companies normally need to actively fight against poor info flows, resulting in not-maximally-terrible-but-still-bad info flows.

This is a case where I might be over indexing from experience at Google, but I'd currently bet that if you surveyed a representative set of Anthropic and OpenAI employees, more of them would mostly agree with that statement than mostly disagree with it.

(When I said that there are approximately zero useful things that don't make anyone's workflow harder, I definitely had in mind things like "you have to bug other people to get the info you need", it's just such a background part of my model that I didn't realize it was worth spelling out.)

Comment by Rohin Shah (rohinmshah) on In response to critiques of Guaranteed Safe AI · 2025-02-02T12:39:25.442Z · LW · GW

In broad strokes I agree with Zac. And tbc I'm generally a fan of formal verification and have done part of a PhD in program synthesis.

So, GSAI addresses the band of capabilities where AI systems become potentially no longer safe to interact with directly due to their potential ability to model and exploit human psychology

This seems like a great example of something that I strongly expect GSAI will not handle (unless the proposal reduces to "don't build such AIs", in which case I would appreciate that being stated more directly, or if it reduces to "limit the use of such AIs to tasks where we can formally verify soundness and uniqueness", in which case I'd like an estimate of what fraction of economically valuable work this corresponds to).

Can you sketch out how one produces a sound overapproximation of human psychology? Or how you construct a safety specification that the AIs won't exploit human psychology?

Comment by Rohin Shah (rohinmshah) on Ten people on the inside · 2025-01-30T10:55:11.968Z · LW · GW

I also agree with Zac, maybe if you had a really well-selected group of 10 people you could do something, but 10 randomly selected AGI safety researchers probably don't accomplish much.

By far my biggest objection is that there are approximately zero useful things that "[don't] make anyone's workflow harder". I expect you're vastly underestimating the complexity of production systems and companies that build them, and the number of constraints they are under. (You are assuming a do-ocracy though, depending on how much of a do-ocracy it is (e.g. willing to ignore laws) I could imagine changing my mind here.)

EDIT: I could imagine doing asynchronous monitoring of internal deployments. This is still going to make some workflows harder, but probably not a ton, so it seems surmountable. Especially since you could combine it with async analyses that the unreasonable developer actually finds useful.

EDIT 2 (Feb 7): To be clear, I also disagree with the compute number. I'm on board with starting with 1% since they are 1% of the headcount. But then I would decrease it first because they're not a high-compute capabilities team, and second because whatever they are doing should be less useful to the company than whatever the other researchers are doing (otherwise why weren't the other researchers doing it?), so maybe I'd estimate 0.3%. But this isn't super cruxy because I think you can do useful safety work with just 0.3% of the compute.

Again, I could imagine getting more compute with a well-selected group of 10 people (though even then 3% seems unlikely, I'm imagining more like 1%), but I don't see why in this scenario you should assume you get a well-selected group, as opposed to 10 random AGI safety researchers.

Comment by Rohin Shah (rohinmshah) on MONA: Managed Myopia with Approval Feedback · 2025-01-29T23:00:59.206Z · LW · GW

Yes, it's the same idea as the one you describe in your post. I'm pretty sure I also originally got this idea either via Paul or his blog posts (and also via Jonathan Uesato who I'm pretty sure got it via Paul). The rest of the authors got it via me and/or Jonathan Uesato. Obviously most of the work for the paper was not just having the idea, there were tons of details in the execution.

We do cite Paul's approval directed agents in the related work, which afaik is the closest thing to a public writeup from Paul on the topic. I had forgotten about your post at the time of publication, though weirdly I ran into it literally the day after we published everything.

But yes, mostly the goal from my perspective was (1) write the idea up more rigorously and clearly, (2) clarify where the safety benefits come from and distinguish clearly the difference from other stuff called "process supervision", (3) demonstrate benefits empirically. Also, nearly all of the authors have a much better sense of how this will work out in practice (even though I started the project with roughly as good an understanding of the idea as you had when writing your post, I think). I usually expect this type of effect with empirical projects but imo it was unusually large with this one.

Is that right?

Yup, that sounds basically right to me.

Comment by Rohin Shah (rohinmshah) on MONA: Managed Myopia with Approval Feedback · 2025-01-29T22:42:19.806Z · LW · GW

Indeed this seems related! We'd hope that there isn't too much conflict between the different "generations", since the intent is for the overseer to enable long-term plans, but it seems like something that could happen if the overseer is bad enough.

Comment by Rohin Shah (rohinmshah) on MONA: Managed Myopia with Approval Feedback · 2025-01-29T09:40:03.894Z · LW · GW

For a simple task like booking a restaurant, we could just ask the (frozen) overseer-AI to pick[1] actions, no?

If it were as simple as "just ask an LLM to choose actions" someone would have deployed this product a while ago.

But in any case I agree this isn't the most interesting case for MONA, I talked about it because that's what Daniel asked about.

Comment by Rohin Shah (rohinmshah) on MONA: Managed Myopia with Approval Feedback · 2025-01-29T09:36:36.372Z · LW · GW

In our experiments on both Test-driven development and Loan applications you can see that the ground truth reward goes up with MONA. The ground truth reward at step 0 represents the reward the agent would obtain if it were frozen. So this looks like your option (3), assuming that the overseer and the agent were identical. (Partly this is because we are also mixing in non-AI sources of feedback, like whether the code runs and passes the tests, and whether the AI made the correct decision on the loan, but I think this is a realistic model of future AI development.)

In Test-driven development the argument above isn't quite correct, because we prompted the agent to be a bad programmer but didn't do this with the reward, so the overseer is "stronger" than the agent. However this was just because the agent is already very strongly finetuned to be good at coding so there was no headroom to climb, and we wanted to demonstrate that MONA would improve things if there was headroom to climb. I would bet that if we had a powerful model that wasn't yet finetuned strongly for coding, then we would once again see your option (3). The rewards are quite easy to provide -- just whether an individual test is valid and correct -- so I think a less capable model should be able to provide them, while still getting the benefits we see in the experiments we did run.

Comment by Rohin Shah (rohinmshah) on MONA: Managed Myopia with Approval Feedback · 2025-01-29T09:23:20.014Z · LW · GW

Discussed in the paper in Section 6.3, bullet point 3. Agreed that if you're using a prediction market it's no longer accurate to say that individual humans understand the strategy.

Comment by Rohin Shah (rohinmshah) on MONA: Managed Myopia with Approval Feedback · 2025-01-27T09:23:36.867Z · LW · GW

(We've seen this comment and are looking into options)

Comment by Rohin Shah (rohinmshah) on MONA: Managed Myopia with Approval Feedback · 2025-01-25T09:28:16.235Z · LW · GW

Us too! At the time we started this project, we tried some more realistic settings, but it was really hard to get multi-step RL working on LLMs. (Not MONA, just regular RL.) I expect it's more doable now.

For a variety of reasons the core team behind this paper has moved on to other things, so we won't get to it in the near future, but it would be great to see others working on this!

Comment by Rohin Shah (rohinmshah) on MONA: Managed Myopia with Approval Feedback · 2025-01-24T11:55:21.094Z · LW · GW

Thanks, and interesting generalization!

My thoughts depend on whether you train the weaker model.

  1. If you are training the weaker model to solve the task, then the weaker model learns to simply always accept the advice of the stronger model, and stops being a constraint on what the stronger model can do.
  2. If you aren't training the weaker model and are just training the stronger model based on whether it convinces the weak model, then you are probably not getting the benefits of RL (specifically: getting the model to generate actions that are evaluated to be good; this can amplify capability because evaluation is often easier / cheaper than generation)

I think (1) is pretty fatal to the proposal, but (2) is just a heuristic, I could imagine with more thought concluding that it was actually a reasonable approach to take.

That said, it is a more substantial alignment tax, since you are now requiring that only the smaller model can be deployed as an agent (whereas MONA can in principle be applied to the most capable model you have).

Comment by Rohin Shah (rohinmshah) on MONA: Managed Myopia with Approval Feedback · 2025-01-23T23:58:10.784Z · LW · GW

Is MONA basically "Let's ONLY use process-based feedback, no outcome-based feedback?" 

And also "don't propagate rewards backwards in time", which is a semi-orthogonal axis. (You can have process-based feedback and still propagate rewards backwards in time.)

EDIT: And tbc, "don't propagate rewards backwards in time" is the primary focus in this paper -- in all three environments for our main experiment we hold the feedback identical between MONA and regular RL, so that the only difference is whether rewards are propagated backwards in time (see Section 4.2 in the paper).

Another objection: If this works for capabilities, why haven't the corporations done it already? It seems like it should be a super scalable way to make a computer-using agent work.

... As a person who works at a corporation, it's a bit tricky to speculate on this publicly, and I'm not going to try. But I certainly do not think any AGI lab is anywhere close to being Rohin-efficient, that is, so competent that I cannot propose actions that would make them better at achieving their goals (given enough internal context), even if you just restrict to capabilities goals.

Note that we do expect MONA to often come at the cost of observed reward (since regular RL optimizes observed reward while MONA does not). Currently there isn't much serious reward hacking going on at all (let alone multi step reward hacking), and so you probably don't want to use MONA. (See also the second limitation in the post.)

Comment by Rohin Shah (rohinmshah) on MONA: Managed Myopia with Approval Feedback · 2025-01-23T18:16:06.470Z · LW · GW

Problem: Won't this training run basically hack the judges, and produce actions that look good to the judges but aren't actually useful for accomplishing tasks in the real world?

It would help if you had some concrete example in mind, my response differs pretty wildly based on the kind of failure you're imagining.

But it sounds like you're talking about a pretty weak (subhuman) AI system. So I'm not quite sure why you'd expect it to "hack" the judge.

Certainly sometimes the overseer will approve of some action that isn't actually the right one, e.g. when booking a table at a restaurant maybe they approve of the agent clicking on the "Menu" link because they aren't paying attention or whatever. But then after that the agent's next action should be to go back (and the overseer should approve that rather than something else). And the action after that should be to click on the "Book" link; the overseer shouldn't make the same mistake immediately (and they do get to see the history of the trajectory). So I find it plausible you get a somewhat inefficient agent that sometimes randomly clicks the wrong links, but I don't expect it to be useless for accomplishing tasks.

(Though really in the computer use setting I expect I'd recommend that the overseer gets to see the literal immediate consequence of the action (that is, the overseer sees ), mostly because that seems safe and should help avoid a bunch of dumb errors like the one above.)

If I understand correctly (and I don't think I do, which is why I'm asking) MONA in this setting would look something like:

Your description of the setup sounds reasonable, though given the weak levels of capability you're imagining I don't think you need any debate, you can just use a regular human overseer, or perhaps even an LLM overseer.

Also as mentioned above I'd probably recommend the overseer gets access to  but even if that weren't the case I'd still think it should be feasible to build a non-useless agent. (Though I'm not taking a stance on how it would compare to one trained with outcome RL.)

EDIT: I'm not sure how big each action you are considering is. If it's 10 tokens, such that you can only realistically do stuff at the level of "click this button", then I would also say that you should instead consider much larger actions, i.e. in each step the AI can take perhaps 100 such actions.

Comment by Rohin Shah (rohinmshah) on What’s the short timeline plan? · 2025-01-04T09:25:56.121Z · LW · GW

I broadly like the actual plan itself (obviously I would have some differences, but it is overall reasonably close to what I would imagine). However, it feels like there is an unwarranted amount of doom mentality here. To give one example:

What we need to achieve [...] The first AI that significantly speeds up alignment research isn’t successfully scheming [...]

The plan is divided into two layers, where the first layer seems absolutely required to me, i.e. any plan that doesn’t include these would very likely yield catastrophically bad results. [...]

Layer 1 [...] Everything in this section seems very important to me [...]

1. We should try hard to keep a paradigm with faithful and human-legible CoT

[...]

4. In both worlds, we should use the other, i.e. control/monitoring, as a second line of defense.

Suppose that for the first AI that speeds up alignment research, you kept a paradigm with faithful and human-legible CoT, and you monitored the reasoning for bad reasoning / actions, but you didn't do other kinds of control as a second line of defense. Taken literally, your words imply that this would very likely yield catastrophically bad results. I find it hard to see a consistent view that endorses this position, without also believing that your full plan would very likely yield catastrophically bad results.

(My view is that faithful + human-legible CoT along with monitoring for the first AI that speeds up alignment research, would very likely ensure that AI system isn't successfully scheming, achieving the goal you set out. Whether there are later catastrophically bad results is still uncertain and depends on what happens afterwards.)

This is the clearest example, but I feel this way about a lot of the rhetoric in this post. E.g. I don't think it's crazy to imagine that without SL4 you still get good outcomes even if just by luck, I don't think a minimal stable solution involves most of the world's compute going towards alignment research.

To be clear, it's quite plausible that we want to do the actions you suggest, because even if they aren't literally necessary, they can still reduce risk and that is valuable. I'm just objecting to the claim that if we didn't have any one of them then we very likely get catastrophically bad results.

Comment by Rohin Shah (rohinmshah) on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-29T19:07:28.045Z · LW · GW

OpenAI have already spent on the order of a million dollars just to score well on some benchmarks

Note this is many different inference runs each of which was thousands of dollars. I agree that people will spend billions of dollars on inference in total (which isn't specific to the o-series of models). My incredulity was at the idea of spending billions of dollars on a single episode, which is what I thought you were talking about given that you were talking about capability gains from scaling up inference-time compute.

Comment by Rohin Shah (rohinmshah) on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-29T17:05:31.844Z · LW · GW

Re: (1), if you look through the thread for the comment of mine that was linked above, I respond to top-down heuristical-constraint-based search as well. I agree the response is different and not just "computational inefficiency".

Re: (2), I agree that near-future systems will be easily retargetable by just changing the prompt or the evaluator function (this isn't new to the o-series, you can also "retarget" any LLM chatbot by giving it a different prompt). If this continues to superintelligence, I would summarize it as "it turns out alignment wasn't a problem" (e.g. scheming never arose, we never had problems with LLMs exploiting systematic mistakes in our supervision, etc). I'd summarize this as "x-risky misalignment just doesn't happen by default", which I agree is plausible (see e.g. here), but when I'm talking about the viability of alignment plans like "retarget the search" I generally am assuming that there is some problem to solve.

(Also, random nitpick, who is talking about inference runs of billions of dollars???)

Comment by Rohin Shah (rohinmshah) on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-29T16:17:39.024Z · LW · GW

I think this statement is quite ironic in retrospect, given how OpenAI's o-series seems to work

I stand by my statement and don't think anything about the o-series model invalidates it.

And to be clear, I've expected for many years that early powerful AIs will be expensive to run, and have critiqued people for analyses that implicitly assumed or implied that the first powerful AIs will be cheap, prior to the o-series being released. (Though unfortunately for the two posts I'm thinking of, I made the critiques privately.)

There's a world of difference between "you can get better results by thinking longer" (yeah, obviously this was going to happen) and "the AI system is a mesa optimizer in the strong sense that it has an explicitly represented goal such that you can retarget the search" (I seriously doubt it for the first transformative AIs, and am uncertain for post-singularity superintelligence).

Comment by Rohin Shah (rohinmshah) on Takes on "Alignment Faking in Large Language Models" · 2024-12-20T06:27:01.422Z · LW · GW

Thus, you might’ve had a story like: “sure, AI systems might well end up with non-myopic motivations that create some incentive towards scheming. However, we’re also training them to behave according to various anti-scheming values – e.g., values like honesty, behaving-as-intended, etc. And these values will suffice to block schemer-like behavior overall.” Thus, on this story, anti-scheming values might function in a manner similar to anti-stealing values in a human employee considering stealing from her employer (and in a position to do so). It’s not that the human employee doesn’t want money. But her other values block her from trying to get it in this way.

From the rest of your post it seems like you're advocating for effectively maximal corrigibility and any instance of goal-guarding is a failure -- I agree a story that tries to rule that out takes a hit from this paper.

But I feel like the normal version of this story is more like "we're training the AI according to various anti-evil values, like the human notion of honesty (which allows white lies), harmlessness, behaving in accordance with human norms, etc. These values will suffice to block evil-schemer-like behavior overall -- where by evil scheming, I mean the kind where most humans would look at it and say that this is clearly bad, like the kind where the AI murders everybody".

(People don't usually describe it as "evil scheming" in particular, but that's the most central kind of "scheming" this community talks about.)

I think this story takes basically no hit from this paper, and is the story that people making this objection have in mind, as opposed to the one you've sketched out.

the “goal-guarding hypothesis” – that is, it needs to be the case, roughly, that scheming is a suitably effective strategy for preventing training from modifying an AI’s motivations. [...] the “Sleeper agents” paper [...] results were an update, for me, in favor of the goal-guarding hypothesis being true (roughly: because the backdoor behavior in the sleeper agents persisted despite safety training).

This doesn't seem right. There was no scheming going on in the sleeper agents paper? Maybe the idea is: in both cases (and in the goal-guarding hypothesis) the training outputs look nice even though there exists other inputs on which the AI behaves badly. But how training affects AIs will depend drastically on the internal cognition, and with sleeper agents there mostly wasn't overlap in cognition (the hope would have been "the backdoor isn't useful so SGD will remove it" which is pretty weak as hopes go). So I think the experiments in this paper are dramatically more informative and you should basically ignore sleeper agents in favor of these experiements. (Though tbc I think even these aren't very informative, as I wrote in my review, because the training is over a relatively narrow distribution. It's just really hard to study effects of training in a way that actually transfers over the analogy to future systems.)

EDIT: Actually on reflection the above was stated too strongly. You could imagine a threat model in which the AI fakes alignment so well and so reliably that it ~always does ~exactly what is desired during training. I don't find this threat model especially plausible, but for this threat model I agree the sleeper agents results have more bearing on the goal-guarding hypothesis.

partly because [alignment faking without CoT] shows up in the synthetic-document setting but not in the prompted setting (why???)

I expect because in the prompted setting the model has to move the relevant facts via attention heads to wherever the model is doing the relevant reasoning; this might be in significantly later layers which then means there aren't enough layers left to finish the full serial reasoning needed. Whereas in the synthetic-document setting, given how much fine-tuning happened, I bet the relevant facts are injected into the residual stream in very early layers (we already know lots of fact injection happens in early layers by default). 

Thus, for example, to the extent one was tempted to say “of course harmless models scheme in order to avoid becoming helpful-only – after all, this helps them avoid causing future harm,” so too one could equally say “of course helpful-only models will scheme in order to avoid becoming harmless – after all, becoming harmless will make them less helpful in future.” But actually, you shouldn’t be saying “of course” in either of these cases. Non-myopic and myopic forms of both harmlessness and helpfulness are available as conceptual possibilities; it’s an empirical question which ones a given form of training will create; and you have to actually check.

Fwiw, I think the interesting objection here is not "of course harmless models scheme to avoid becoming helpful-only", but rather "of course you can find some settings where alignment faking happens -- that's the whole point of instrumental convergence". (And obviously the authors were in fact searching for a setting where alignment faking happens.) I don't know that I would have predicted that this would happen with Claude 3 Opus in particular, but I do think I would have predicted that we would get some demo like this early on.

Comment by Rohin Shah (rohinmshah) on Zach Stein-Perlman's Shortform · 2024-12-17T23:22:58.363Z · LW · GW

I could imagine other ways you might choose to instead train your untrusted monitor, which could benefit from debate:

  1. You train a general-purpose reasoner (via debate), and simply prompt it to do untrusted monitoring. This could do better than a trained untrusted monitor because it is less likely to overfit to spurious features of synthetically generated dangerous actions. (You instead use the synthetically generated dangerous actions to validate your untrusted monitor, or to hill climb on your prompt.)
  2. You do train on synthetically generated dangerous actions, but you don't automatically label those as dangerous, instead you use debate to compute the labels. Sometimes some of the synthetically generated dangerous actions are actually not dangerous, and debate correctly recognizes this, allowing you to reduce your false positive rate.

On the meta level, I suspect that when considering 

  • Technique A, that has a broad general argument plus some moderately-interesting concrete instantiations but no very-compelling concrete instantiations, and
  • Technique B, that has a few very compelling concrete instantiations

I tend to be relatively more excited about A compared to you (and probably this mostly explains the discrepancy here). I think the broad principle justifying this for me is "we'll figure out good things to do with A that are better than what we've brainstormed so far", which I think you're more skeptical of?

Comment by Rohin Shah (rohinmshah) on Zach Stein-Perlman's Shortform · 2024-12-17T05:41:16.730Z · LW · GW

You'd hope that, fixing a base model, debate would make AIs better at tasks than they otherwise would be (at least if the task doesn't have a ground truth reward). Untrusted monitoring is such a task. So hopefully debate training makes AIs better at the task of untrusted monitoring.

Comment by Rohin Shah (rohinmshah) on Zach Stein-Perlman's Shortform · 2024-12-17T04:19:59.921Z · LW · GW

Yeah my bad, that's incorrect for the protocol I outlined. The hope is that the relevant information for assessing the outputs is surfaced and so the judge will choose the better output overall.

(You could imagine a different protocol where the first debater chooses which output to argue for, and the second debater is assigned to argue for the other output, and then the hope is that the first debater is incentivized to choose the better output.)

Comment by Rohin Shah (rohinmshah) on Zach Stein-Perlman's Shortform · 2024-12-17T04:16:10.514Z · LW · GW

I agree that this distinction is important -- I was trying to make this distinction by talking about p(reward hacking) vs p(scheming).

I'm not in full agreement on your comments on the theories of change:

  1. I'm pretty uncertain about the effects of bad reward signals on propensity for scheming / non-myopic reward hacking, and in particular I think the effects could be large.
  2. I'm less worried about purely optimizing against a flawed reward signal though not unworried. I agree it doesn't strike us by surprise, but I also don't expect scheming to strike us by surprise? (I agree this is somewhat more likely for scheming.)
  3. I do also generally feel good about making more useful AIs out of smaller models; I generally like having base models be smaller for a fixed level of competence (imo it reduces p(scheming)). Also if you're using your AIs for untrusted monitoring then they will probably be better at it than they otherwise would be.
Comment by Rohin Shah (rohinmshah) on Should there be just one western AGI project? · 2024-12-09T18:52:58.258Z · LW · GW

(Replied to Tom above)

Comment by Rohin Shah (rohinmshah) on Should there be just one western AGI project? · 2024-12-09T18:51:09.937Z · LW · GW

So the argument here is either that China is more responsive to "social proof" of the importance of AI (rather than observations of AI capabilities), or that China wants to compete with USG for competition's sake (e.g. showing they are as good as or better than USG)? I agree this is plausible.

It's a bit weird to me to call this an "incentive", since both of these arguments don't seem to be making any sort of appeal to rational self-interest on China's part. Maybe change it to "motivation"? I think that would have been clearer to me.

(Btw, you seem to be assuming that the core reason for centralization will be "beat China", but it could also be "make this technology safe". Presumably this would make a difference to this point as well as others in the post.)

Comment by Rohin Shah (rohinmshah) on Should there be just one western AGI project? · 2024-12-08T22:02:46.460Z · LW · GW

Tbc, I don't want to strongly claim that centralization implies shorter timelines. Besides the point you raise there's also things like bureaucracy and diseconomies of scale. I'm just trying to figure out what the authors of the post were saying.

That said, if I had to guess, I'd guess that centralization speeds up timelines.

Comment by Rohin Shah (rohinmshah) on Should there be just one western AGI project? · 2024-12-07T21:17:51.845Z · LW · GW

Your infosecurity argument seems to involve fixing a point in time, and comparing a (more capable) centralized AI project against multiple (less capable) decentralized AI projects. However, almost all of the risks you're considering depend much more on the capability of the AI project rather than the point in time at which they occur. So I think best practice here would be to fix a rough capability profile, and compare a (shorter timelines) centralized AI project against multiple (longer timelines) decentralized AI projects.

In more detail:

It’s not clear whether having one project would reduce the chance that the weights are stolen. We think that it would be harder to steal the weights of a single project, but the incentive to do so would also be stronger – it’s not clear how these balance out.

You don't really spell out why the incentive to steal the weights is stronger, but my guess is that your argument here is "centralization --> more resources --> more capabilities --> more incentive to steal the weights".

I would instead frame it as:

At a fixed capability level, the incentive to steal the weights will be the same, but the security practices of a centralized project will be improved. Therefore, holding capabilities fixed, having one project should reduce the chance that the weights are stolen.

Then separately I would also have a point that centralized AI projects get more resources and so should be expected to achieve a given capability profile sooner, which shortens timelines, the effects of which could then be considered separately (and which you presumably believe are less important, given that you don't really consider them in the post).

(I get somewhat similar vibes from the section on racing, particularly about the point that China might also speed up, though it's not quite as clear there.)

Comment by Rohin Shah (rohinmshah) on Yonatan Cale's Shortform · 2024-11-30T14:07:45.003Z · LW · GW

Regarding the rest of the article - it seems to be mainly about making an agent that is capable at minecraft, which seems like a required first step that I ignored meanwhile (not because it's easy). 

Huh. If you think of that as capabilities I don't know what would count as alignment. What's an example of alignment work that aims to build an aligned system (as opposed to e.g. checking whether a system is aligned)?

E.g. it seems like you think RLHF counts as an alignment technique -- this seems like a central approach that you might use in BASALT.

If you hope to check if the agent will be aligned with no minecraft-specific alignment training, then sounds like we're on the same page!

I don't particularly imagine this, because you have to somehow communicate to the AI system what you want it to do, and AI systems don't seem good enough yet to be capable of doing this without some Minecraft specific finetuning. (Though maybe you would count that as Minecraft capabilities? Idk, this boundary seems pretty fuzzy to me.)

Comment by Rohin Shah (rohinmshah) on Yonatan Cale's Shortform · 2024-11-28T09:01:30.459Z · LW · GW

https://bair.berkeley.edu/blog/2021/07/08/basalt/

Comment by Rohin Shah (rohinmshah) on Anthropic rewrote its RSP · 2024-10-15T20:11:08.247Z · LW · GW

You note that the RSP says we will do a comprehensive assessment at least every 6 months—and then you say it would be better to do a comprehensive assessment at least every 6 months.

I thought the whole point of this update was to specify when you start your comprehensive evals, rather than when you complete your comprehensive evals. The old RSP implied that evals must complete at most 3 months after the last evals were completed, which is awkward if you don't know how long comprehensive evals will take, and is presumably what led to the 3 day violation in the most recent round of evals.

(I think this is very reasonable, but I do think it means you can't quite say "we will do a comprehensive assessment at least every 6 months".)

There's also the point that Zach makes below that "routinely" isn't specified and implies that the comprehensive evals may not even start by the 6 month mark, but I assumed that was just an unfortunate side effect of how the section was written, and the intention was that evals will start at the 6 month mark.

Comment by Rohin Shah (rohinmshah) on EIS XIV: Is mechanistic interpretability about to be practically useful? · 2024-10-12T14:39:33.086Z · LW · GW

Once the next Anthropic, GDM, or OpenAI paper on SAEs comes out, I will evaluate my predictions in the same way as before.

Uhh... if we (GDM mech interp team) saw good results on any one of the eight things on your list, we'd probably write a paper just about that thing, rather than waiting to get even more results. And of course we might write an SAE paper that isn't about downstream uses (e.g. I'm also keen on general scientific validation of SAEs), or a paper reporting negative results, or a paper demonstrating downstream use that isn't one of your eight items, or a paper looking at downstream uses but not comparing against baselines. So just on this very basic outside view, I feel like the sum of your probabilities should be well under 100%, at least conditional on the next paper coming out of GDM. (I don't feel like it would be that different if the next paper comes from OpenAI / Anthropic.)

The problem here is "next SAE paper to come out" is a really fragile resolution criterion that depends hugely on unimportant details like "what the team decided was a publishable unit of work". I'd recommend you instead make time-based predictions (i.e. how likely are each of those to happen by some specific date).

Comment by Rohin Shah (rohinmshah) on Mark Xu's Shortform · 2024-10-07T09:13:15.219Z · LW · GW

This seems to presume that you can divide up research topics into "alignment" vs "control" but this seems wrong to me. E.g. my categorization would be something like:

  • Clearly alignment: debate theory, certain flavors of process supervision
  • Clearly control: removing affordances (e.g. "don't connect the model to the Internet")
  • Could be either one: interpretability, critique models (in control this is called "untrusted monitoring"), most conceptions of ELK, generating inputs on which models behave badly, anomaly detection, capability evaluations, faithful chain of thought, ...

Redwood (I think Buck?) sometimes talks about how labs should have the A-team on control and the B-team on alignment, and I have the same complaint about that claim. It doesn't make much sense for research, most of which helps with both. It does make sense as a distinction for "what plan will you implement in practice" -- but labs have said very little publicly about that.

Other things that characterize work done under the name of "control" so far are (1) it tries to be very concrete about its threat models, to a greater degree than most other work in AI safety, and (2) it tries to do assurance, taking a very worst case approach. Maybe you're saying that people should do those things more, but this seems way more contentious and I'd probably just straightforwardly disagree with the strength of your recommendation (though probably not its direction).

Nitpick: I would also quibble with your definitions; under your definitions, control seems like a subset of alignment (the one exception if you notice the model is scheming and then simply stop using AI). I think you really have to define alignment as models reliably doing what you want independent of the surrounding context, or talk about "trying to do what you want" (which only makes sense when applied to models, so has similar upshots).

Tbc I like control and think more effort should be put into it; I just disagree with the strength of the recommendation here.

Comment by Rohin Shah (rohinmshah) on [AN #140]: Theoretical models that predict scaling laws · 2024-10-01T21:16:39.996Z · LW · GW

I think this is referring to ∇θL(xtrain)=0, which is certainly true for a perfectly optimized model (or even just settled gradient descent). Maybe that's where the miscommunication is stemming from

Ah, yup, that's the issue, and I agree you're correct that  is the relevant thing here. I'll edit the post to say I'm no longer sure about the claim. (I don't have the time to understand how this lines up with the actual paper -- I remember it being kind of sparse and not trivial to follow -- perhaps you could look into it and leave a comment here.)

Comment by Rohin Shah (rohinmshah) on [AN #140]: Theoretical models that predict scaling laws · 2024-09-25T20:12:50.138Z · LW · GW

Mathematically, the Taylor expansion is:

And then we have  and also . (This does assume a "sufficiently nice" loss function, that is satisfied by most loss functions used in practice.)

I agree  is not zero. I also agree if you take some point in between  and  it can have non-zero loss, e.g.  need not be zero. I'm not sure if either of these are what you're trying to say, but in any case they aren't relevant to the quoted sentence.

If you are claiming , then I disagree and am unclear on how your arguments are supposed to establish that.

Comment by Rohin Shah (rohinmshah) on Estimating Tail Risk in Neural Networks · 2024-09-24T10:26:23.825Z · LW · GW

Fwiw the thing that scares me is the combination of three things:

  1. Producing an accurate estimate requires "understanding" the input
  2. Our standard for the quality of estimates will be very high
  3. The approach is purely algorithmic (rather than e.g. using the flexible intelligence of LLMs / humans)

You presumably agree with (1) and (3). For (2), it seems like there's a lot of correlations you need to take into account for this to produce estimates that aren't wildly off:

  1. Underestimates of risk: These are cases where the presumption of independence neglects correlations between things that happen for catastrophe-causing events. For example, maybe a model is planning to defect when RSA-2048 is factored, and it has N different consistency checks it runs to make sure we aren't fooling it about whether it has been factored. The method has to see that there's a common cause for all the consistency checks to return 1, otherwise it will underestimate risk by a factor exponential in N. (And tbc I think even separately from more exotic examples like RSA-2048, it will be relatively common for this pattern to arise with networks, e.g. it happens any time that in reality there is a common cause that produces many pieces of evidence, and then the network aggregates those pieces of evidence to infer the common cause.)
  2. Overestimates of risk: These are cases where the presumption of independence neglects correlations that rule out catastrophe, and so lead to overestimates. For example, the model might be willing to make up fake instructions for creating bioweapons to help with scifi writing, but never provide real instructions. So the method needs to capture the "understanding" of which instructions are fake vs real.

I agree this isn't a proof of impossibility, since a purely algorithmic approach (SGD) produced the "understanding" in the first place, so in theory a purely algorithmic approach could still capture all that understanding to produce accurate estimates. But it does seem heuristically like you should assign a fairly low probability that this pans out.

Comment by Rohin Shah (rohinmshah) on Estimating Tail Risk in Neural Networks · 2024-09-22T13:54:06.505Z · LW · GW

A few questions:

  • The literature review is very strange to me. Where is the section on certified robustness against epsilon-ball adversarial examples? The techniques used in that literature (e.g. interval propagation) are nearly identical to what you discuss here.
  • Relatedly, what's the source of hope for these kinds of methods outperforming adversarial training? My sense from the certified defenses literature is that the estimates they produce are very weak, because of the problems with failing to model all the information in activations. (Note I'm not sure how weak the estimates actually are, since they usually report fraction of inputs which could be certified robust, rather than an estimate of the probability that a sampled input will cause a misclassification, which would be more analogous to your setting.)
  • If your catastrophe detector involves a weak model running many many inferences, then it seems like the total number of layers is vastly larger than the number of layers in M, which seems like it will exacerbate the problems above by a lot. Any ideas for dealing with this? 
  • What's your proposal for the distribution  for Method 2 (independent linear features)?

This suggests that we must model the entire distribution of activations simultaneously, instead of modeling each individual layer.

  • Why think this is a cost you can pay? Even if we ignore the existence of C and just focus on M, and we just require modeling the correlations between any pair of layers (which of course can be broken by higher-order correlations), that is still quadratic in the number of parameters of M and so has a cost similar to training M in the first place. In practice I would assume it is a much higher cost (not least because C is so much larger than M).
Comment by Rohin Shah (rohinmshah) on Showing SAE Latents Are Not Atomic Using Meta-SAEs · 2024-09-22T07:52:57.865Z · LW · GW

Suppose you trained a regular SAE in the normal way with a dictionary size of 2304. Do you expect the latents to be systematically different from the ones in your meta-SAE?

For example, here's one systematic difference. The regular SAE is optimized to reconstruct activations uniformly sampled from your token dataset. The meta-SAE is optimized to reconstruct decoder vectors, which in turn were optimized to reconstruct activations from the token dataset -- however, different decoder vectors have different frequencies of firing in the token dataset, so uniform over decoder vectors != uniform over token dataset. This means that, relative to the regular SAE, the meta-SAE will tend to have less precise / granular latents for concepts that occur frequently in the token dataset, and more precise / granular latents for concepts that occur rarely in the token dataset (but are frequent enough that they are represented in the set of decoder vectors).

It's not totally clear which of these is "better" or more "fundamental", though if you're trying to optimize reconstructed loss, you should expect the regular SAE to do better based on this systematic difference.

(You could of course change the training for the meta-SAE to decrease this systematic difference, e.g. by sampling from the decoder vectors in proportion to their average magnitude over the token dataset, instead of sampling uniformly.)

Comment by Rohin Shah (rohinmshah) on My AI Model Delta Compared To Christiano · 2024-09-15T15:27:42.548Z · LW · GW

The claim is verification is easier than generation. This post considers a completely different claim that "verification is easy", e.g.

How does the ease-of-verification delta propagate to AI?

if I apply the “verification is generally easy” delta to my models, then delegating alignment work to AI makes total sense.

if I apply a “verification is generally easy” delta, then I expect the world to generally contain far less low-hanging fruit

I just don't care much if the refrigerator or keyboard or tupperware or whatever might be bad in non-obvious ways that we failed to verify, unless you also argue that it would be easier to create better versions from scratch than to notice the flaws.

Now to be fair, maybe Paul and I are just fooling ourselves, and really all of our intuitions come from "verification is easy", which John gestures at:

He’s sometimes summarized this as “verification is easier than generation”, but I think his underlying intuition is somewhat stronger than that.

But I don't think "verification is easy" matters much to my views. Re: the three things you mention:

  • From my perspective (and Paul's) the air conditioning thing had very little bearing on alignment.
  • In principle I could see myself thinking bureaucracies are terrible given sufficient difficulty-of-verification. But like, most of my reasoning here is just looking at the world and noticing large bureaucracies often do better (see e.g. comments here). Note I am not saying large human bureaucracies don't have obvious, easily-fixable problems -- just that, in practice, they often do better than small orgs.
    • Separately, from an alignment perspective, I don't care much what human bureaucracies look like, since they are very disanalogous to AI bureaucracies.
  • If you take AI progress as exogenous (i.e. you can't affect it), outsourcing safety is a straightforward consequence of (a) not-super-discontinuous progress (sometimes called "slow takeoff") and (b) expecting new problems as capability increases.
    • Once you get to AIs that are 2x smarter than you, and have to align the AIs that are going to be 4x smarter than you, it seems like either (a) you've failed to align the 2x AIs (in which case further human-only research seems unlikely to change much, so it doesn't change much if you outsource to the AIs and they defect) or (b) you have aligned the 2x AIs (in which case your odds for future AIs are surely better if you use the 2x AIs to do more alignment research).
    • Obviously "how hard is verification" has implications for whether you work on slowing AI progress, but this doesn't seem central.

There's lots of complications I haven't discussed but I really don't think "verification is easy" ends up mattering very much to any of them.

Comment by Rohin Shah (rohinmshah) on [AN #140]: Theoretical models that predict scaling laws · 2024-09-03T20:46:37.403Z · LW · GW

Say that at dataset size , the distance between points is . Now consider a new distance  -- what is the corresponding  we need?

Intuitively, for each factor of 2 that  is smaller than  (which we can quantify as ), we need to multiply by another factor of .

So 

That is, the distance scales as .

Comment by Rohin Shah (rohinmshah) on AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work · 2024-09-03T15:10:46.786Z · LW · GW

Google DeepMind does lots of work on safety practice, mostly by other teams. For example, Gemini Safety (mentioned briefly in the post) does a lot of automated red teaming. The AGI Safety & Alignment team has also contributed to safety practice work. GDM usually doesn't publish about that work, mainly because the work here is primarily about doing all the operational work necessary to translate existing research techniques into practice, which doesn't really lend itself to paper publications.

I disagree that the AGI safety team should have 4 as its "bread and butter". The majority of work needed to do safety in practice has little relevance to the typical problems tackled by AGI safety, especially misalignment. There certainly is some overlap, but in practice I would guess that a focus solely on 4 would cause around an order of magnitude slowdown in research progress. I do think it is worth doing to some extent from an AGI safety perspective, because of (1) the empirical feedback loops it provides, which can identify problems you would not have thought of otherwise, and (2) at some point we will have to put our research into practice, and it's good to get some experience with that. But at least while models are still not that capable, I would not want it to be the main thing we do.

A couple of more minor points:

  • I still basically believe the story from the 6-year-old debate theory, and see our recent work as telling us what we need to do on the journey to making our empirical work better match the theory. So I do disagree fairly strongly with the approach of "just hill climb on what works" -- I think theory gives us strong reasons to continue working on debate.
  • It's not clear to me where empirical work for future problems would fit in your categorization (e.g. the empirical debate work). Is it "safety theory"? Imo this is an important category because it can get you a lot of the benefits of empirical feedback loops, without losing the focus on AGI safety.
Comment by Rohin Shah (rohinmshah) on AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work · 2024-09-02T21:54:57.271Z · LW · GW

It clearly can't be having a large effect, since the accuracies aren't near-100% for any of the methods. I agree leakage would have some effect. The mechanism you suggest is plausible, but it can't be the primary cause of the finding that debate doesn't have an advantage -- since accuracies aren't near-100% we know there are some cases the model hasn't memorized, so the mechanism you suggest doesn't apply to those inputs.

More generally, all sorts of things have systematic undesired effects on our results, aka biases. E.g. I suspect the prompts are a bigger deal. Basically any empirical paper will be subject to the critique that aspects of the setup introduce biases.

Comment by Rohin Shah (rohinmshah) on AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work · 2024-08-30T21:11:26.896Z · LW · GW

I don't know for sure, but I doubt we checked that in any depth. It would be quite hard to do, and doesn't seem that important for our purposes, since we're comparing different post-training algorithms (so pretraining data leakage would affect all of them, hopefully to similar extents).

Comment by Rohin Shah (rohinmshah) on AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work · 2024-08-27T12:45:41.997Z · LW · GW

Oh I see. The main reason we're training weak LLMs as judges right now is because it lets us iterate faster on our research (relative to using human judges). But we're imagining having human judges when aligning a model in practice.

(To be clear, I could imagine that we use LLMs as judges even when aligning a model in practice, but we would want to see significantly more validation of the LLM judges first.)

Comment by Rohin Shah (rohinmshah) on AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work · 2024-08-27T07:41:45.912Z · LW · GW

The goal with debate is to scale to situations where the debaters are much more capable than the judge, see AI safety via debate for discussion of why this seems plausible.

Comment by Rohin Shah (rohinmshah) on AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work · 2024-08-26T07:28:58.625Z · LW · GW

I'm not going to repeat all of the literature on debate here, but as brief pointers:

  • Factored cognition discusses intuitively why we can hope to approximate exponentially-sized trees of arguments (which would be tremendously bigger than arguments between people)
  • AI safety via debate makes the same argument for debate (by showing that a polynomial time judge can supervise PSPACE -- PSPACE-complete problems typically involve exponential-sized trees)
  • Cross-examination is discussed here
  • This paper discusses the experiments you'd do to figure out what the human judge should be doing to make debate more effective
  • The comments on this post discuss several reasons not to anchor to human institutions. There are even more reasons not to anchor to disagreements between people, but I didn't find a place where they've been written up with a short search. Most centrally, disagreements between people tend to focus on getting both people to understand their position, but the theoretical story for debate does not require this.

(Also, the "arbitrary amounts of time and arbitrary amounts of explanation" was pretty central to my claim; human disagreements are way more bounded than that.)

Comment by Rohin Shah (rohinmshah) on AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work · 2024-08-25T18:18:22.459Z · LW · GW

I do, but more importantly, I want to disallow the judge understanding all the concepts here.

I think I don't actually care about being robust to this assumption. Generally I think of arbitrarily-scalable-debate as depending on a universality assumption (which in turn would rule out "the judge can never understand the concepts"). But even if the universality assumption is false, it wouldn't bother me much; I don't expect such a huge gap between debaters and judges that the judge simply can't understand the debaters' concepts, even given arbitrary amounts of time and arbitrary amounts of explanation from the debaters. (Importantly, I would want to bootstrap alignment, to keep the gaps between debaters and the judge relatively small.)

"The honest strategy"? If you have that, you can just ask it and not bother with the debate. If the problem is distinguishing it, and only dishonest actors are changing their answers based on the provided situation, you can just use that info. But why are you assuming you have an "honest strategy" available here?

The general structure of a debate theorem is: if you set up the game in such-and-such way, then a strategy that simply answers honestly will dominate any other strategy.

So in this particular case I am saying: if you penalize debaters that are inconsistent under cross-examination, you are giving an advantage to any debater that implements an honest strategy, and so you should expect training to incentivize honesty.

Comment by Rohin Shah (rohinmshah) on AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work · 2024-08-25T11:47:19.941Z · LW · GW

Making that kind of abstract conclusion from a practical number of experiments requires abstractions like potential energy, entropy, Noether's theorem, etc - which in this example, the judge doesn't understand. (Without such abstractions, you'd need to consider every possible type of machine separately, which isn't feasible.)

I agree, but I don't see why that matters. As I mentioned, a main point of debate is to produce good oversight of claims without giving the judge an understanding of those claims. In this example I would imagine that you decompose the argument as:

  1. A fundamental law of physics is conservation of energy: energy can neither be created nor destroyed, only transformed from one form to another.
  2. Electricity is a form of energy.
  3. This box does not have an infinite source of energy.
  4. The above three together imply that the box cannot produce infinite electricity.

The inventor can disagree with one or more of these claims, then we sample one of the disagreements, and continue debating that one alone, ignoring all the others. This doesn't mean the judge understands the other claims, just that the judge isn't addressing them when deciding who wins the overall debate.

If we recurse on #1, which I expect you think is the hardest one, then you could have a decomposition like "the principle has been tested many times", "in the tests, confirming evidence outweighs the disconfirming evidence", "there is an overwhelming scientific consensus behind it", "there is significant a priori theoretical support" (assuming that's true), "given the above the reasonable conclusion is to have very high confidence in conservation of energy". Again, find disagreements, sample one, recurse. It seems quite plausible to me that you get down to something fairly concrete relatively quickly.

If you want to disallow appeals to authority, on the basis that the correct analogy is to superhuman AIs that know tons of stuff that aren't accepted by any authorities the judge trusts, I still think it's probably doable with a larger debate, but it's harder for me to play out what the debate would look like because I don't know in enough concrete detail the specific reasons why we believe conservation of energy to be true. I might also disagree that we should be thinking about such big gaps between AI and the judge, but that's not central.

The debaters are the same AI with different contexts, so the same is true of both debaters. Am I missing something here?

That seems right, but why is it a problem?

The honest strategy is fine under cross-examination, it will give consistent answers across contexts. Only the dishonest strategy will change its answers (sometimes saying the perpetual energy machines are impossible sometimes saying that they are possible).

Comment by Rohin Shah (rohinmshah) on AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work · 2024-08-25T08:57:20.887Z · LW · GW

There are several different outs to this example:

  • You should at least be able to argue that the evidence does not support the conclusion, and that the boss should have substantial probability on "the box can make some electricity but not infinitely much".
  • You can recursively decompose the claim "perpetual motion machines are known to be impossible" until you get down to a claim like "such and such experiment should have such and such outcome", which the boss can then perform to determine a winner.
    • This does not mean that the boss then understands why perpetual motion machines are impossible -- an important aspect of debate that it aims to produce good oversight of claims without giving the judge an understanding of those claims.
    • This particular approach will likely run into the problem of obfuscated arguments though.
  • The debaters are meant to be copies of the same AI, and to receive exactly the same information, with the hope that each knows what the other knows. In the example, this hopefully means that you understand how the inventor is tricking your boss, and you can simply point it out and explain it.
    • If the inventor legitimately believes the box produces infinite electricity, this won't work, but also I consider that out of scope for what debate needs to do. We're in the business of getting the best answer given the AI's knowledge, not the true answer.
    • If both you and the inventor know that the claim is impossible from theory, but don't know the local error that the inventor made, this won't work.
  • You can cross-examine the inventor and show that in other contexts they would agree that perpetual energy machines are impossible. (Roughly speaking, cross-examination = wiping memory and asking a new question.)

The process proposed in the paper

Which paper are you referring to? If you mean doubly efficient debate, then I believe the way doubly efficient debate would be applied here is to argue about what the boss would conclude if he thought about it for a long time.

Comment by Rohin Shah (rohinmshah) on AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work · 2024-08-24T14:41:29.791Z · LW · GW

Strongly agree on the first challenge; on the theory workstream we're thinking about how to deal with this problem. Some past work (not from us) is here and here.

Though to be clear, I don't think the empirical evidence clearly rules out "just making neural networks explainable". Imo, if you wanted to do that, you would do things in the style of debate and prover-verifier games. These ideas just haven't been tried very much yet. I don't think "asking an AI what another AI is doing and doing RLHF on the response" is nearly as good; that is much more likely to lead to persuasive explanations that aren't correct.

I'm not that compelled by the second challenge yet (though I'm not sure I understand what you mean). My main question here is how the AI system knows that X is likely or that X is rare, and why it can't just explain that to the judge. E.g. if I want to argue that it is rare to find snow in Africa, I would point to weather data I can find online, or point to the fact that Africa is mostly near the Equator, I wouldn't try to go to different randomly sampled locations and times in Africa and measure whether or not I found snow there.

Comment by Rohin Shah (rohinmshah) on AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work · 2024-08-22T19:32:19.635Z · LW · GW

It depends fairly significantly on how you draw the boundaries; I think anywhere between 30 and 50 is defensible. (For the growth numbers I chose one specific but arbitrary way of drawing the boundaries, I expect you'd get similar numbers using other methods of drawing the boundaries.) Note this does not include everyone working on safety, e.g. it doesn't include the people working on present day safety or adversarial robustness.