What's up with "Responsible Scaling Policies"?

habryka4

What's up with "Responsible Scaling Policies"?

post by habryka (habryka4), ryan_greenblatt · 2023-10-29T04:17:07.839Z · LW · GW · 9 comments

  Naming and the Intentions Behind RSPs
  Endorsements and Future Trajectories
  How well can you tell if a given model is existentially dangerous?
  Defense through "Control" vs "Lack of Propensities"
  Summaries
None
9 comments

habryka

I am interested in talking about whether RSPs are good or bad. I feel pretty confused about it, and would appreciate going in-depth with someone on this.

My current feelings about RSPs are roughly shaped like the following:

I feel somewhat hyper-alert and a bit paranoid about terms around AI X-Risk getting redefined, since it feels like a thing that has happened a bunch with "AI Alignment" and is also the kind of thing that happens a lot when you are trying to influence large bureaucratic institutions (see also all the usual Orwell stuff on what governments do here). A good chunk of my concerns about RSPs are specific concerns about the term "Responsible Scaling Policy".

I also feel like there is a disconnect and a bit of a Motte-and-Bailey going on where we have like one real instance of an RSP, in the form of the Anthropic RSP, and then some people from ARC Evals who have I feel like more of a model of some platonic ideal of an RSP, and I feel like they are getting conflated a bunch. Like, I agree that there are things that are kind of like RSPs that could be great, but I feel like the Anthropic RSP in-particular doesn't really have any teeth and so falls a bit flat as the kind of thing that is supposed to help with risk.

ryan_greenblatt

Disclaimer: I don't primarily work on advocacy or policy and it's plausible that if I worked in these areas more, my takes on these topics would update substantially. That said, a large fraction of my work does involve thinking about questions like "What would good safety arguments look like?" and "With the current state of safety technology, what can we do to assess and mitigate AI risk?". (This was added in editing.)

Stuff which is maybe interesting to talk about:

What should anti-takeover/safety advocacy do?
How much can current tech actually reduce risk? Is this a crux?
What would current interventions for avoiding takeover look like?
Does timing pauses matter?
What would a hardcore actually good long run pause look like etc?
Will labs actually follow through etc.
Is it bad for takeover concerns to get lumped in with misuse that maybe isn't that bad? Like I'm not actually very worried about e.g. ASL-3. Or maybe this is great because some countermeasures transfer?
Will there be good RSPs in the future? (Like Anthropic's RSP is basically an RSP IOU on the most important questions. I think this is probably ok-ish.)

ryan_greenblatt

My basic takes on RSPs:

I agree with most of the stuff in https://www.lesswrong.com/posts/dxgEaDrEBkkE96CXr/thoughts-on-responsible-scaling-policies-and-regulation. [LW · GW] Exceptions:
- My baseline guess at risk is somewhat higher than Paul's such that I don't think RSPs get you down to 1% risk
- I'm less bullish on reducing risk by 10x, but maybe 5x seems pretty doable? Idk
- I think good affirmative safety cases based on something like mech interp are quite unlikely in the next 5 years, so such requirements seem close to defacto bans. (I'm not sure if Paul disagrees with me here, but this seems like an important thing to note.)
I agree the term seems bad. Feels like probably a mistake. I also find it amusing that OpenAI is using a different term (RDP)
I'm not that worried about risk from unknown-unknowns, but I do think "AIs messing with your evaluations" is a big problem. I think it's possible to mostly address this by checking if the AIs are capable of messing with your evaluations (which is probably easier than just avoiding them messing with your evaluations).

Naming and the Intentions Behind RSPs

ryan_greenblatt

I think it would have been better if ARC said something like: "Labs should have policies for when they scale up: Scaling Policies (SP). We hope these policies are actually safe and we're happy to consult on what SPs would be responsible and reduce risk".

I would also prefer names like Conditional Safety improving Policy or similar.

habryka

I do really feel like the term "Responsible Scaling Policy" clearly invokes a few things which I think are not true:

How fast you "scale" is the primary thing that matters for acting responsibly with AI
It is clearly possible to scale responsibly (otherwise what would the policy govern)
The default trajectory of an AI research organization should be to continue scaling

James Payor on Twitter has a kind of similar take I resonate with:

I'm here to add that "AGI Scaling Policy" would be a more honest name than "Responsible Scaling Policy".
The "RSP" name implicitly says (a) there's no reason to just stop; (b) it is possible to scale responsibly; (c) the ontology of the policy isn't open to revision; etc.
...and these feel like awful presumptions, seemingly deliberately there to make it harder to coordinate an AI stop. I do otherwise like the content, but I am pretty soured by the presence of this stuff and associated implications.

I also feel kind of confused about what even the definition of an RSP is supposed to be. The ARC Evals post just says:

An RSP specifies what level of AI capabilities an AI developer is prepared to handle safely with their current protective measures, and conditions under which it would be too dangerous to continue deploying AI systems and/or scaling up AI capabilities until protective measures improve.

But that doesn't really strike me as an adequate definition. There is of course a lot of stuff that fits this definition that would be weird to call an RSP. Like, the scaling law paper kind of tries to answer these two questions, but clearly doesn't qualify as an RSP.

The above also feels confused in its basic conceptual structure. Like, how can an RSP specify what level of AI capabilities a developer is prepared to handle safety? Reality specifies that for you. Maybe it's supposed to say "it specifies what level of AI capabilities an AI developer thinks they can handle"?

Similarly for the next sentence. How can the RSP specify the conditions under which it would be too dangerous to continue deploying AI systems? Like, again, reality is your only arbiter here. The RSP can specify the conditions under which the AI developer thinks it would be too dangerous, but that also seems weird to put in a policy, since that's the kind of thing that will presumably change a bunch over time.

ryan_greenblatt

I agree that the term "Responsible Scaling Policy" invokes false stuff. I guess this doesn't seem that bad to me? Maybe I'm underestimating the importance of names.

ryan_greenblatt

Another thing that I feel kind of confused about in the space is "where is the definition of an RSP?". The ARC Evals post says:

> An RSP specifies what level of AI capabilities an AI developer is prepared to handle safely with their current protective measures, and conditions under which it would be too dangerous to continue deploying AI systems and/or scaling up AI capabilities until protective measures improve.

Yeah, seems like this paragraph would be better if you added the word "claim". Like "An RSP specifies what level of AI capabilities an AI developer is claiming they're prepared to handle safely with their current protective measures, and conditions under which they think it would be too dangerous to continue deploying AI systems and/or scaling up AI capabilities until protective measures improve."

The hope is something like:

The AI developer makes an RSP which is making some implicit or explicit claims.
People can then argue this is insufficient for safety now that there is a clear policy to argue with.

habryka

I agree that the term "Responsible Scaling Policy" invokes false stuff. I guess this doesn't seem that bad to me?

Yeah, I do feel like the primary dimension on which I try to judge concepts and abstractions introduced in both AI regulation and AI alignment is whether those concepts invoke true things and carve reality at its joints.

I agree that I could try to do a thing that's more naive consequentialist and be like "Yeah, ok, let's just kind of try to play forward what humanity will do when it runs into these ideas, and when it sees these specific humans saying these words", but like, even then I feel like RSPs don't look that great, because at that level I mostly expect them to land with "Oh, these people say they are being responsible" or something like that.

To be clear, a thing that I like that both the Anthropic RSP and the ARC Evals RSP post point to is basically a series of well-operationalized conditional commitments. One way an RSP could be is to basically be a contract between AI labs and the public that concretely specifies "when X happens, then we commit to do Y", where X is some capability threshold and Y is some pause commitment, with maybe some end condition.

ryan_greenblatt

Concretely, what do you think ARC evals and Anthropic should do? Like should they use a different term?

habryka

Like, instead of an RSP I would much prefer a bunch of frank interviews with Dario and Daniella where someone is like "so you think AGI has a decent chance of killing everyone, then why are you building it?". And in-general to create higher-bandwidth channels that people can use to understand what people at leading AI labs believe about the risks from AI, and when the benefits are worth it.

ryan_greenblatt

It seems pretty important to me to have some sort of written down and maintained policy on "when would we stop increasing the power of our models" and "what safety interventions will we have in place for different power levels".

ryan_greenblatt

TBC, frank interview on the record also seems good.

ryan_greenblatt

Part of why I want policies is that this makes it easier for people inside the org to whistleblow or to object to stuff internally.

I also think that it would be better if more people thought more concretely about what the current plan looks like.

habryka

It seems pretty important to me to have some sort of written down and maintained policy on "when would we stop increasing the power of our models" and "what safety interventions will we have in place for different power levels".

Yeah, to be clear, I think that would also be great. And I do feel pretty sold on "AI capability companies should have an up-to-date document that outlines their scaling and safety plans". I do feel like a lot of conversations would clearly go better in worlds where that kind of plan was available.

ryan_greenblatt

It seems like we maybe have mild but not huge disagreements on the actual artifacts we wish labs would provide to the world which explain their plans?

I generally think more honest clear communication and specific plans on everything seems pretty good on current margins (e.g., RSPs, clear statements on risk, clear explanation of why labs are doing things that they know are risky [LW · GW], detailed discussion with skeptics, etc).

habryka

I do feel like there is a substantial tension here between two different types of artifacts here:

A document that is supposed to accurately summarize what decisions the organization is expecting to make in different circumstances
A document that is supposed to bind the organization to make certain decisions in certain circumstances

Like, the current vibe that I am getting is that RSPs are a "no-take-backsies" kind of thing. You don't get to publish an RSP saying "yeah, we aren't planning to scale" and then later on to be like "oops, I changed my mind, we are actually going to go full throttle".

And my guess is this is the primary reason why I expect organizations to not really commit to anything real in their RSPs and for them to not really capture what leadership of an organization thinks the tradeoffs are. Like, that's why the Anthropic RSP has a big IOU where the actually most crucial decisions are supposed to be.

ryan_greenblatt

Oh, gotcha, some difference between binding policies and explanations.

I agree that RSPs make it relatively more costly to change things in a way which reduces safety. This seems like mostly a benefit from my perspective? It also seems fine for RSPs to have some sections which are like "our tenative best guess at what we should do" and some sections which are like "our current policy". (For example, the Anthropic RSP has a tenative guess for ASL-4, but more solid policies elsewhere.)q

ryan_greenblatt

Maybe one interesting question is "how likely is Anthropic to flesh out ASL-4 evaluations and safety interventions prior to ASL-3 being triggered"? (Which would pay off the IOU.)

Or flesh it out in the next 2 years conditional on ASL-3 not being triggered in those two years.

I'm moderately optimistic that this happens. (Maybe 60% on ASL-4 evaluations and some specific-sounding-to-me ASL-4 interventions in 2 years conditional on no ASL-3)

habryka

Like, here is an alternative to "RSP"s. Call them "Conditional Pause Commitments" (CPC if you are into acronyms).

Basically, we just ask AGI companies to tell us under what conditions they will stop scaling or stop otherwise trying to develop AGI. And then also some conditions under which they would resume. Then we can critique those.

This seems like a much clearer abstraction that's less philosophically opinionated about whether the thing is trying to be an accurate map of an organization's future decisions, or to what degree it's supposed to seriously commit an organization, or whether the whole thing is "responsible".

ryan_greenblatt

Like, here is an alternative to "RSP"s. Call them "Conditional Pause Commitments" (CPC if you are into acronyms).

I agree CPC is a better term and concept. I think this is basically what RSPs are trying to be?

But it seems worth noting that countermeasures are a key part of the picture. Like the conditions under which you would pause might depend on what you have implemented so far etc. I feel like CPC doesn't naturally evoke the countermeasures part, but it seems like overall a better term. (We could add this obviously, but then the name is terrible: Condition Pause Commitments including Countermeasures CPCC)

habryka

Maybe one interesting question is "how likely is anthropic to flesh out ASL-4 evaluations and safety interventions prior to ASL-3 being triggered"? (Which would pay off the IOU.)
(Or flesh it out in the next 2 years conditional on ASL-3 not being triggered.)

I'm moderately optimistic that this happens. (Maybe 60% on ASL-4 evaluations and some specific-sounding-to-me ASL-4 interventions in 2 years conditional on no ASL-3)

Yeah, I agree that this is an interesting question, and I roughly agree with your probability here.

I do think the world where they don't follow it up is really bad though. I really don't like the 40% of worlds where it turns out the RSP was the kind of thing that caused a bunch of people to get off of Anthropic's back about building really dangerous AI, and is the kind of thing that causes tons of leaders in AI Safety to support them, and then the thing just never really materializes.

And like, I feel like this isn't the first time where I got a bad vibe from some Anthropic announcements here. Like they also have their long term benefit trust, and it really looks like the kind of thing that's supposed to ensure independence of Anthropic from profit incentives, but like, buried in like one sentence at the end of a random paragraph in the middle is the bomb-shell that actually the governance committee doesn't really have the ability to stop Anthropic as long as some supermajority of shareholders disagree, and they don't specify the actual concrete numbers for the supermajority, and then I feel like the whole thing just falls flat for me.

Like, without giving me that concrete number, this commitment just doesn't really have much teeth.

ryan_greenblatt

I'm pretty worried about flaky or shitty RSPs as well as RSP IOUs which never happen. I feel like the risk of this is considerably higher at OpenAI or GDM than Anthropic.

We'll see what OpenAIs first RDP looks like. And we'll see if GDM does anything official.

Endorsements and Future Trajectories

ryan_greenblatt

I do think the world where they don't follow it up is really bad though. Like I really don't like the 40% of worlds where it turns out the RSP was the kind of thing that caused a bunch of people to get off of Anthropic's back about building really dangerous AI, and is the kind of thing that causes tons of leaders in AI Safety to support them, and then the thing just never really materializes.

I think people should keep pressuring them some until they have more fleshed out ASL-4 evaluations and commitments.

I guess I feel like we can withhold full endorsement to some extent which might reduce the risk of this sort of outcome.

ryan_greenblatt

I feel like the things I want to be advocating for are stuff like:

ASL-4 commitments and evaluations from anthropic (can be preliminary etc)
Some RSP like thing from other labs. It would still be progress even if I think the policy is wildly unsafe, then we can at least have arguments about the quality of this concrete policy.
Misc improvements to the RSP like things.

habryka

I do think I am pretty seriously worried about a bunch of stuff in the "capture of the AI Safety field by AGI companies" space. Like, a majority of people working in AI Safety are already employed by AGI companies. I feel like I don't have that many more years of continuing in the status quo before my ability to actually withdraw endorsement is basically fully eroded.

So a bunch of my reaction to the Anthropic RSP is a worry that if I don't object now, the field won't be in a position to object later when it becomes more clear this thing isn't serious (in worlds where this is the case), because by then so many people's careers and reputations and ability to contribute are directly contingent on getting along well with the capability companies that there won't be much hope for some kind of withdrawal of endorsement.

ryan_greenblatt

Closely related thing I want: people working at AI labs should think through and write down the conditions under which they would loudly quit (this can be done privately, but maybe should be shared between like minded employees for common knowledge). Then, people can hopefully avoid getting frog-boiled.

I feel concerned about the possibility that people avoid quitting when they really should due to thoughts like "I really don't want to quit because then my influence would evaporate and I should just try influence things to be a bit better. Also, XYZ promised that stuff would be better if I just wait a bit." Like there are probably some pretty manipulative people at various AI labs.

habryka

people working at AI labs should think through and write down the conditions under which they would loudly quit (this can be done privately, but maybe should be shared between like minded employees for common knowledge). Then, people can hopefully avoid getting frog-boiled.

Yeah, I do think this would be great as well.

ryan_greenblatt

I'm also onboard with "it's scary that tons of the safety people are working at labs and this trend seems likely to continue".

ryan_greenblatt

So a bunch of my reaction to the Anthropic RSP is a worry that if I don't object now, the field won't be in a position to object later when it becomes more clear this thing isn't serious (in worlds where this is the case), because by then so many people's careers and reputations and ability to contribute are directly contingent on getting along well with the capability companies that there won't be much hope for some kind of withdrawal of endorsement.

Why don't you think you can withhold endorsement now and say "I'm waiting for ASL-4 standards and evaluations"? (Idk if that important to get into.)

habryka

Well, I can, but I kind of expect that over the course of the next few years my opinion will end up not really having much weight left, on the current default social trajectory. So like, I can withhold my endorsement, but I expect the balance of whose endorsement matters to shift towards the people who are not withholding endorsement.

ryan_greenblatt

Ok, understood. I guess I do expect that you'll probably lose influence over time for various reasons. I'm not sure what you should do about this.

How well can you tell if a given model is existentially dangerous?

habryka

I do think I want to back up a bit and talk about a different aspect of RSPs that keeps coming up for me. I don't have a great handle for this, but maybe the best way to describe it is to be like "Man, RSPs sure really make it sound like we have a great handle on when AGI will become dangerous" or something.

Like, a concern I have about both the hypothetical "Conditional Pause Commitments" and RSPs is that you end up in a state where you are committed to allowing some AGI company to scale if they just goodheart some random safety benchmark that you gave them. The reality is that identifying whether a system is safe, or whether it is (for example) deceptively aligned, is extremely hard, and we don't really have a ton of traction on this problem, and so I don't really expect that we will actually have any great suggestions for safety benchmarks that we feel confident can withstand billions to trillions of dollars of economic pressure and substantially smarter AI systems.

ryan_greenblatt

I do think I want to back up a bit and talk about a different aspect of RSPs that keeps coming up for me. I don't have a great handle for this, but maybe the best way to describe it is to be like "Man, RSPs sure really make it sound like we have a great handle on when AGI will become dangerous" or something.

I think that we do have a reasonable handle on how to know if a given deployment of a given model is existentially dangerous (and how to run this sort of evaluation without the evaluation itself being too dangerous to run).

I feel less confident in how to do this sort of evaluation for when it's safe to open-source models because things like scaffolding or fine-tuning might progress and make these open sourced models dangerous.

My level of confidence here isn't overwhelming, but I think we can evaluate when doom starts being >1% without being extremely conservative. Maybe 0.1% is harder.

I mean something like "If you keep scaling GPT by 0.25 GPTs, can you make an evaluation such that at the point when the evaluation indicates danger, doom is <1%".

And I think this probably transfers fine to other ML based approaches.

habryka

Huh, I am a bit surprised by this. I believe that you can do this with a "randomly sampled" system, but I don't think you can do this with an adversarially selected system. Like, I agree that you can probably roughly identify the scale of compute and resources (assuming no huge algorithmic breakthroughs and RSI and a bunch of other stuff) where things get dangerous, but like, the whole point of an RSP is to then have the AGI companies try to get past some benchmarks you set about what they need to do to get further, and this is the part that seems hard to me.

(Sidenote: This mirrors a general type of conversation I tend to have a lot around AI Alignment research these days, where there is a lot of scary demos, interpretability, and evals work whose goal it is to determine whether an AI is unaligned, and my usual response is "ok, I am quite confident you will get tons of evidence that AI systems are not aligned with you within the next few years. My primary question is what you will actually do as soon as you have identified a system as unaligned or dangerous in this way" and then people surprisingly frequently don't really have an answer)

ryan_greenblatt

Oh, ok, I agree that the benchmarks for when to move forward are very tricky.

My claim is more like "reasonably good ASL-3 and ASL-4 evals should be possible" (including them being robust to the possibility that deceptively aligned AI systems might try to sandbag on these evals)

habryka

What was the definition of ASL-3 and ASL-4 again?

ryan_greenblatt

I think roughly:

ASL-3: maybe dangerous from a misuse perspective, probably not autonomously dangerous (the eval for >=ASL-3 is operationalized in terms of ARA (autonomous replication) or serious misuse risk)

ASL-4: could seriously do ARA (autonomous replication) and might risk of escape or other serious shit.

habryka

My claim is more like "reasonably good ASL-3 and ASL-4 evals should be possible"

Where by "evals" you mean something like "evals on when we should classify an AI system as ASL-3, ASL-4 or even higher"?

Or do you mean that we might be able to make evals such that even when multiple billions of dollars are spent on matching that eval, that the eval would tell us whether a given ASL-3 or ASL-4 system could be safely more widely deployed? (like, maybe made open-source for ASL-3 or be deployed at all for ASL-4)

ryan_greenblatt

I'm talking about evals to classify and rule out >=ASL-4, not the evals to indicate such systems are safe.

I'm also not claiming these evals would necessarily be robust to heavy optimization pressure to make the AI look dumb to this eval while being smart.

I haven't thought much about this; this does seems like a substantial concern to me.

ryan_greenblatt

I think we should probably talk about "what safety arguments will/can labs make for ASL-4+ models"?

habryka

Where by "safety arguments" you mean "arguments that the relevant models are safe to develop/use/deploy/distribute?"

ryan_greenblatt

Yep, subject to some protocol which might e.g. restrict usage.

habryka

Yeah, seems good. This does seem pretty important, and as someone who at least right now has thought a lot more about this than I expect labs to think about this, my answer sure is something like "man, do I have no idea".

To be clear, there are clearly some training setups that seem more dangerous than other training setups, but I don't really know how to think about that in the case of safety arguments, and even then I feel quite deeply confused.

Like, as an example, my guess is systems where a substantial chunk of the compute was spent on training with reinforcement learning in environments that reward long-term planning and agentic resource acquisition (e.g. many video games or diplomacy or various simulations with long-term objectives) sure seem more dangerous.

But I think I only believe that with like 75% probability? Like I really wouldn't be surprised if at the end of the day I come to believe that ultimately what matters for a system to be dangerous is just the amount of compute and the richness of the data you threw into it, and all the arguments for instrumental convergence already totally apply to a large language model with maybe a bit of RLHF sprinkled on top, and in that case the systems trained with reinforcement learning on rich environments might actually be safer because a bunch of the compute was used much less efficiently and you might be more able to elicit the long-term planning and leverage it to do useful things with it.

And like, I don't have a ton of traction on reducing that probability distribution much further, besides maybe "just do a ton more AI Alignment research, poke the hell out of current models, think really hard, write a lot of blogposts, etc.". And so I have a hard time imagining an AGI company actually handing me a safety argument that meaningfully shifts probability mass for me here given our current level of understanding.

Defense through "Control" vs "Lack of Propensities"

ryan_greenblatt

Here's my basic state on making safety arguments.

I think safety arguments could have two (partially overlapping) “lines of defense”:

Control: The AI isn’t capable of doing seriously dangerous things (without getting caught first) given our countermeasures.
Propensities: The AI isn’t likely to try doing seriously dangerous things.

And my overall takes on the safety usefulness tradeoffs for current science:

Our technology is very far from being able to confidently evaluate the propensities of powerful AI systems.
Control is likely possible to evaluate and also ensure for early transformative AI (e.g. models which can speed up humans doing ML R&D by 30x)
We can at least test somewhat propensities using things like model organisms [AF · GW] and adversarial evaluation (trying to produce synthetic inputs on which the AI does something very bad).
Like I think "we tried hard to adversarially evaluate the model and it didn't do anything very suspicious" is non-trivial evidence, though not amazing.
And I think "we tried to create deceptive alignment model organisms, but our best model organisms are pretty weak sauce" is also some evidence.
I think there is like a 70% chance that we'll be able establish good control arguments for early transformative AI which reduce risk by about 3-5x while making models only somewhat less useful (e.g. 2x less useful)
Obviously, control doesn't scale. (So the safety usefulness tradeoffs get much worse as you scale)
The methodology for evaluating control is non-trivial and complicated. I don't know how to specify the evalutation in a way which can't be gamed.
I think control+model organisms+adversarial evalution could reduce risk by 5-15x and that this isn't that unlikely if we try really hard.

habryka

These three things feel pretty hard to be true at the same time:

Control is likely possible to evaluate and also ensure for early transformative AI (e.g. models which can speed up humans doing ML R&D by 30x)
[...]
Like I think "we tried hard to red team the model and it didn't do anything very suspicious" is non-trivial evidence, though not amazing.
[...]
And I think "we tried to create deceptive alignment model organisms, but our best model organisms are pretty weak sauce" is also some evidence.

Like, it's not impossible, but I do sure expect that you cannot get 30x ML R&D without red-teaming breaking the model pretty badly (at least if red-teaming involves fine-tuning).

Separately, I sure expect you can get pretty solidly deceptively aligned model organisms long before you can speed up ML R&D 30x, and I don't feel like we have a ton of traction on finding some kind of training setup where that isn't the case anymore.

ryan_greenblatt

Like, as an example, my guess is systems where a substantial chunk of the compute was spent on training with reinforcement learning in environments that reward long-term planning and agentic resource acquisition (e.g. many video games or diplomacy or various simulations with long-term objectives) sure seem more dangerous

Yeah, so one type of thing we can do is "make guesses about tweaks to training which would make dangerous propensities less likely". I don't think there are a ton of good ideas here and I expect that the safety usefulness tradeoffs from this don't seem amazing. (So I agree with you here, I just think there are other avenues.)

Like, it's not impossible, but I do sure expect that you cannot get 30x ML R&D without red-teaming breaking the model pretty badly (at least if red-teaming involves fine-tuning).

I should have stated what I meant more clearly here. I mean "adversarial evaluation (trying to produce synthetic inputs on which the AI does something very bad)"

We could get into why I think this is non-trivial evidence.

habryka

Ok, I have to head out in 5 minutes, so let's try to wrap up.

I do feel like a major disagreement here is probably something like "how optimistic am I about the control route", which I know that you and Buck have been thinking about a lot. I have definitely mostly been thinking about AI Alignment through the propensity lens, since I've been looking for ways to scale the the methods to superintelligence, so I don't have super robust opinions here, but I do have a sense that this is going to be much harder than my model of you thinks it is.

ryan_greenblatt

FYI, I'm intentionally avoiding the term "alignment" because that term now has a bunch of baggage. So I'd like to just say safety arguments and then talk more specifically.

habryka

That seems very reasonable.

ryan_greenblatt

Yep, worth noting I haven't yet argued for my level of optimism. So we'd maybe want to do that next if we want to continue.

Summaries

habryka

Summarizing what we covered in this dialogue (feel free to object to any of it):

A substantial chunk of my objections were structured around the naming and definitions of RSPs. You agreed that those didn't seem super truth-promoting, but also thought the costs of that didn't seem high enough to make RSPs a bad idea.

We then covered the problem of the Anthropic RSP, as the primary example of an RSP, having kind of a big IOU shaped hole in it. I was concerned this would allow Anthropic to avoid accountability and that people who would hold them accountable wouldn't have enough social buy-in by the time the IOU came due. You agreed this was a reasonable concern (though my guess is you also think that there would still be people who would hold them accountable in the future, more so than I do, or at least you seem less concerned about it).

I then switched gears and started a conversation about the degree to which we are even capable of defining evals that meaningfully tell us whether a system is safe to deploy. We both agreed this is pretty hard, especially when it comes to auditing the degree to which the propensities and motivations of a system are aligned with us. You however think that if we take it as a given that a system has motivations that are unaligned with us, then we still have a decent chance of catching when that system might be dangerous, and using those evals we might be able to inch towards a world where we can make substantially faster AI Alignment progress leveraging those systems. This seemed unlikely to me, but we didn't have time to go into it.

Does this seem overall right to you? I am still excited about continuing this conversation and maybe digging into the "Alignment vs. Control" dimension, but seems fine to do that in a follow-up dialogue after we published this one.

ryan_greenblatt

Yep, this seems like a reasonable summary to me.

One quick note:

(though my guess is you also think that there would still be people who would hold them accountable in the future, more so than I do, or at least you seem less concerned about it)

Yep, I think that various people will be influential in the future and will hold Anthropic accountable. In particular:

Anthropic's Long-Term Benefit Trust (LTBT) (who approves RSP changes I think?)
People at ARC evals
Various people inside Anthropic seem to me to be pretty honestly interested in having an actually good RSP

9 comments

Comments sorted by top scores.

comment by ryan_greenblatt · 2025-01-28T23:16:33.099Z · LW(p) · GW(p)

I think I roughly stand behind my perspective in this dialogue. I feel somewhat more cynical than I did at the time I did this dialogue, perhaps partially due to actual updates from the world and partially because I was trying to argue for the optimistic case here which put me in a somewhat different frame.

Here are some ways my perspective differs now:

I wish I said something like: "AI companies probably won't actually pause unilaterally, so the hope for voluntary RSPs has to be building consensus or helping to motivate developing countermeasures". I don't think I would have disagreed with this statement in the past, or at least I wouldn't have fully disagreed with it and it seems like important context.
I think in practice, we're unlikely to end up with specific tests that are defined in advance and aren't goodhartable or cheatable. I do think that control could in principle be defined in advance and hard to goodhart using external evaluation, but I don't expect companies to commit to specific tests which are hard to goodhart/cheat. They could make procedural commitments for third party review which are hard to cheat. Something like "this third party will review the available evidence (including our safety report and all applicable internal knowledge) and then make a public statement about the level of risk and whether there is important information which should be disclosed to the public" (I could outline this proposal in more detail, it's mostly not my original idea.)
I'm somewhat more interested in companies focusing on things other than safety cases and commitments. Either trying to get evidence of risk that might be convincing to others (in worlds where these risks are large) or working on at-the-margin safety interventions from a cost benefit perspective.

comment by TurnTrout · 2023-10-30T20:01:35.928Z · LW(p) · GW(p)

ok, I am quite confident you will get tons of evidence that AI systems are not aligned with you within the next few years. My primary question is what you will actually do as soon as you have identified a system as unaligned or dangerous in this way

Any operationalizations that people might make predictions on?

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2023-10-31T04:52:50.785Z · LW(p) · GW(p)

The most clear case I can think of is:

Prior to AI escape or takeover, will we catch an AI in production clearly seriously trying to do an egregiously bad action when not at all prompted or directly trained to do so. For example, trying:

to escape
to create bioweapons
to do some action which is clearly part of an attempted takeover (prior two action might not be part of takeover)

(This might not be because of internal motivations well described as deceptive alignment.)

(To be clear, I am not confidently predicting we will see this outcome and the original quote is from habryka.)

Replies from: TurnTrout

↑ comment by TurnTrout · 2023-11-06T22:58:21.440Z · LW(p) · GW(p)

I expect to not see this, conditional on adding a stipulation like "the AI wasn't scaffolded and then given a goal like 'maximize profit'", because I could imagine the AI-system coming up with nasty subgoals. In particular, I don't expect egregiously bad actions from autoregressive sampling of an LLM tasked with doing scientific research.

comment by Zach Stein-Perlman · 2023-10-29T15:16:52.936Z · LW(p) · GW(p)

there are clearly some training setups that seem more dangerous than other training setups . . . .
Like, as an example, my guess is systems where a substantial chunk of the compute was spent on training with reinforcement learning in environments that reward long-term planning and agentic resource acquisition (e.g. many video games or diplomacy or various simulations with long-term objectives) sure seem more dangerous.

Any recommended reading on which training setups are safer? If none exist, someone should really write this up.

comment by Parpotom · 2023-10-30T13:51:05.028Z · LW(p) · GW(p)

Informatic developpers often say that the bug is usually situated in the chair/keyboqrd interface.

I fear that even very good RSP will be inefficient in case of sheer stupidity or deliberate malevolence.

There are already big threats:
- nuclear weapons. The only thing wich protect us is the fact that only governements can use them, witch mean that our security depends on the fact that said governements will be responsible enough. So far it worked, but there is no actual protection against really insane leaders.
- global warming. We know it since the begining of the century but we act as if we could indefinitely postpone the application of solutions, knowing full well that this is not true.

It doesn't bode well for the future.

Also one feature of RSPs would be to train IA so that they could not present major risks. But what if LLMs are developed as free software and anyone can train them in their own way? I don't see how we could control them or impose limits.

(English is not my mother tongue: please forgive my mistakes.)

comment by Ariel_ (ariel-g) · 2023-10-29T16:37:17.647Z · LW(p) · GW(p)

This was very interesting, looking forward to the follow up!

In the "AIs messing with your evaluations" (and checking for whether the AI is capable of/likely to do so) bit, I'm curious if there is any published research on this.

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2023-10-29T17:21:00.054Z · LW(p) · GW(p)

The closest existing thing I'm aware of is password locked models [LW · GW].

Redwood Research (where I work) might end up working on this topic or similar sometime in the next year.

It also wouldn't surprise me if Anthropic puts out a paper on exploration hacking [AF · GW] somewhat soon.

comment by Zach Stein-Perlman · 2023-10-29T15:16:43.581Z · LW(p) · GW(p)

This is great. Some quotes I want to come back to:

a thing that I like that both the Anthropic RSP and the ARC Evals RSP post point to is basically a series of well-operationalized conditional commitments. One way an RSP could be is to basically be a contract between AI labs and the public that concretely specifies "when X happens, then we commit to do Y", where X is some capability threshold and Y is some pause commitment, with maybe some end condition.

instead of an RSP I would much prefer a bunch of frank interviews with Dario and Daniella where someone is like "so you think AGI has a decent chance of killing everyone, then why are you building it?". And in-general to create higher-bandwidth channels that people can use to understand what people at leading AI labs believe about the risks from AI, and when the benefits are worth it.

It seems pretty important to me to have some sort of written down and maintained policy on "when would we stop increasing the power of our models" and "what safety interventions will we have in place for different power levels".

I generally think more honest clear communication and specific plans on everything seems pretty good on current margins (e.g., RSPs, clear statements on risk, clear explanation of why labs are doing things that they know are risky, detailed discussion with skeptics, etc).

I do feel like there is a substantial tension here between two different types of artifacts here:
A document that is supposed to accurately summarize what decisions the organization is expecting to make in different circumstances
A document that is supposed to bind the organization to make certain decisions in certain circumstances
Like, the current vibe that I am getting is that RSPs are a "no-take-backsies" kind of thing. You don't get to publish an RSP saying "yeah, we aren't planning to scale" and then later on to be like "oops, I changed my mind, we are actually going to go full throttle".
And my guess is this is the primary reason why I expect organizations to not really commit to anything real in their RSPs and for them to not really capture what leadership of an organization thinks the tradeoffs are. Like, that's why the Anthropic RSP has a big IOU where the actually most crucial decisions are supposed to be.

Like, here is an alternative to "RSP"s. Call them "Conditional Pause Commitments" (CPC if you are into acronyms).
Basically, we just ask AGI companies to tell us under what conditions they will stop scaling or stop otherwise trying to develop AGI. And then also some conditions under which they would resume. [Including implemented countermeasures.] Then we can critique those.
This seems like a much clearer abstraction that's less philosophically opinionated about whether the thing is trying to be an accurate map of an organization's future decisions, or to what degree it's supposed to seriously commit an organization, or whether the whole thing is "responsible".

people working at AI labs should think through and write down the conditions under which they would loudly quit (this can be done privately, but maybe should be shared between like minded employees for common knowledge). Then, people can hopefully avoid getting frog-boiled.

What's up with "Responsible Scaling Policies"?

Contents

Naming and the Intentions Behind RSPs

Endorsements and Future Trajectories

How well can you tell if a given model is existentially dangerous?

Defense through "Control" vs "Lack of Propensities"

Summaries

9 comments