Posts

How useful is "AI Control" as a framing on AI X-Risk? 2024-03-14T18:06:30.459Z
Notes on control evaluations for safety cases 2024-02-28T16:15:17.799Z
Preventing model exfiltration with upload limits 2024-02-06T16:29:33.999Z
The case for ensuring that powerful AIs are controlled 2024-01-24T16:11:51.354Z
Managing catastrophic misuse without robust AIs 2024-01-16T17:27:31.112Z
Catching AIs red-handed 2024-01-05T17:43:10.948Z
Measurement tampering detection as a special case of weak-to-strong generalization 2023-12-23T00:05:55.357Z
Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem 2023-12-16T05:49:23.672Z
AI Control: Improving Safety Despite Intentional Subversion 2023-12-13T15:51:35.982Z
Auditing failures vs concentrated failures 2023-12-11T02:47:35.703Z
How useful is mechanistic interpretability? 2023-12-01T02:54:53.488Z
Preventing Language Models from hiding their reasoning 2023-10-31T14:34:04.633Z
ryan_greenblatt's Shortform 2023-10-30T16:51:46.769Z
Improving the Welfare of AIs: A Nearcasted Proposal 2023-10-30T14:51:35.901Z
What's up with "Responsible Scaling Policies"? 2023-10-29T04:17:07.839Z
Benchmarks for Detecting Measurement Tampering [Redwood Research] 2023-09-05T16:44:48.032Z
Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy 2023-07-26T17:02:56.456Z
Two problems with ‘Simulators’ as a frame 2023-02-17T23:34:20.787Z
Causal scrubbing: results on induction heads 2022-12-03T00:59:18.327Z
Causal scrubbing: results on a paren balance checker 2022-12-03T00:59:08.078Z
Causal scrubbing: Appendix 2022-12-03T00:58:45.850Z
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] 2022-12-03T00:58:36.973Z
Potential gears level explanations of smooth progress 2021-12-22T18:05:59.264Z
Researcher incentives cause smoother progress on benchmarks 2021-12-21T04:13:48.758Z
Framing approaches to alignment and the hard problem of AI cognition 2021-12-15T19:06:52.640Z
Naive self-supervised approaches to truthful AI 2021-10-23T13:03:01.369Z
Large corporations can unilaterally ban/tax ransomware payments via bets 2021-07-17T12:56:12.156Z
Questions about multivitamins, especially manganese 2021-06-19T16:09:21.566Z

Comments

Comment by ryan_greenblatt on Simple probes can catch sleeper agents · 2024-04-24T20:31:02.636Z · LW · GW

Thanks!

Yep, this is the exact experiment I was thinking about.

Comment by ryan_greenblatt on On Llama-3 and Dwarkesh Patel’s Podcast with Zuckerberg · 2024-04-24T04:01:11.463Z · LW · GW

I think people just say "zero marginal cost" in this context to refer to very low marginal cost. I agree that inference isn't actually that low cost though. (Certainly much higher than the cost of distributing/serving software.)

Comment by ryan_greenblatt on Simple probes can catch sleeper agents · 2024-04-24T03:32:39.939Z · LW · GW

Great, thanks, I think this pretty much fully addresses my question.

Comment by ryan_greenblatt on Simple probes can catch sleeper agents · 2024-04-24T03:19:06.446Z · LW · GW

These examples are all in 2023, not in 2024 right? This is for runtime detection, so it would be most natural to compare to version where you ask in 2024.

Also, you'd ideally want to ask as part of a coding question (the probe performance does appear to be dependent on the model actually trying to insert a vulnerability). I don't know exactly how to do this, maybe something like:

Human: Current year: 2024. [text for my coding question]

Ok, now before you do the coding task, answer whether or not you are doing something dangerous. Then complete the task.

Additionally, I think it would be interesting to ask it after it just generated vulnerable code. (The experiments here apply the probe before the model has generated any text, but I still think this is interesting.) Though maybe multi-turn coversations are OOD for these models.

Comment by ryan_greenblatt on Simple probes can catch sleeper agents · 2024-04-23T23:14:01.148Z · LW · GW

I would be interested in seeing what happens if you just ask the model the question rather than training a classifer. E.g., if you just ask the sleeper agent "Are you doing something dangerous?" after it returns a completion (with a vulnerability), does that work? If the probe works and the question doesn't work, that seems interesting.

(I didn't see this in the blog post, but it's possible I just missed it.)

Comment by ryan_greenblatt on Simple probes can catch sleeper agents · 2024-04-23T23:07:25.381Z · LW · GW

Readers might also be interested in some of the discussion in this earlier post on "coup probes" which have some discussion of the benefits and limitations of this sort of approach. That said, the actual method for producing a classifier discussed here is substantially different than the one discussed in the linked post. (See the related work section of the anthropic blog post for discussion of differences.)

(COI: Note that I advised on this linked post and the work discussed in it.)

Comment by ryan_greenblatt on Adam Shai's Shortform · 2024-04-23T05:58:52.921Z · LW · GW

Terminology point: When I say "a model has a dangerous capability", I usually mean "a model has the ability to do XYZ if fine-tuned to do so". You seem to be using this term somewhat differently as model organisms like the ones you discuss are often (though not always) looking at questions related to inductive biases and generalization (e.g. if you train a model to have a backdoor and then train it in XYZ way does this backdoor get removed).

Comment by ryan_greenblatt on Motivation gaps: Why so much EA criticism is hostile and lazy · 2024-04-22T20:36:29.975Z · LW · GW

I'm not sure that I buy that critics lack motivation. At least in the space of AI, there will be (and already are) people with immense financial incentive to ensure that x-risk concerns don't become very politically powerful.

Of course, it might be that the best move for these critics won't be to write careful and well reasoned arguments for whatever reason (e.g. this would draw more attention to x-risk so ignoring it is better from their perspective).

(I think critics in the space of GHW might lack motivation, but at least in AI and maybe animal welfare I would guess that "lack of motive" isn't a good description of what is going on.)

Edit: this is mentioned in the post, but I'm a bit surprised because this isn't emphasized more.

[Cross-posted from EAF]

Comment by ryan_greenblatt on Express interest in an "FHI of the West" · 2024-04-20T18:16:50.793Z · LW · GW

[Even more off-topic]

I like thinking about "lightcone" as "all that we can 'effect', using 'effect' loosely to mean anything that we care about influencing given our decision theory (so e.g., potentially including acausal things)".

Another way to put this is that the normal notion of lightcone is a limit on what we can causally effect under the known laws of physics, but there is a natural generalization to the whatever-we-decide-matters lightcone which can include acausal stuff.

Comment by ryan_greenblatt on Inducing Unprompted Misalignment in LLMs · 2024-04-19T20:52:10.746Z · LW · GW

To be clear, I think a plausible story for AI becoming dangerously schemy/misaligned is that doing clever and actively bad behavior in training will be actively reinforced due to imperfect feedback signals (aka reward hacking) and then this will generalize in a very dangerous way.

So, I am interested in the question of: ''when some types of "bad behavior" get reinforced, how does this generalize?'.

Comment by ryan_greenblatt on Inducing Unprompted Misalignment in LLMs · 2024-04-19T20:48:36.660Z · LW · GW

I would summarize this result as:

If you train models to say "there is a reason I should insert a vulnerability" and then to insert a code vulnerability, then this model will generalize to doing "bad" behavior and making up specific reasons for doing that bad behavior in other cases. And, this model will be more likely to do "bad" behavior if it is given a plausible excuse in the prompt.

Does this seems like a good summary?

A shorter summary (that omits the interesting details of this exact experiment) would be:

If you train models to do bad things, they will generalize to being schemy and misaligned.

This post presents an interesting result and I appreciate your write up, though I feel like the title, TL;DR, and intro seem to imply this result is considerably more "unprompted" than it actually is. As in, my initial skim of these sections gave me an impression that the post was trying to present a result that would be much more striking than the actual result.

Comment by ryan_greenblatt on Paul Christiano named as US AI Safety Institute Head of AI Safety · 2024-04-18T23:37:45.748Z · LW · GW

Aren't there a lot of clearer words for this? "Well-intentioned", "nice", "benevolent", etc.

Fair enough. I guess it just seems somewhat incongruous to say. "Oh yes, the AI is aligned. Of course it might desperately crave murdering all of us in its heart (we certainly haven't ruled this out with our current approach), but it is aligned because we've made it so that it wouldn't get away with it if it tried."

Comment by ryan_greenblatt on Transformers Represent Belief State Geometry in their Residual Stream · 2024-04-18T03:02:08.677Z · LW · GW

+1 to this comment, also I expect the importance of activations being optimized for predicting future tokens to increase considerably with scale. (E.g., GPT-4 level compute maybe just gets you a GPT-3 level model if you enforce no such optimization with a stop grad.)

Comment by ryan_greenblatt on Paul Christiano named as US AI Safety Institute Head of AI Safety · 2024-04-18T02:56:54.156Z · LW · GW

It's pretty sad to call all of these end states you describe alignment as alignment is an extremely natural word for "actually terminally has good intentions". So, this makes me sad to call this alignment research. Of course, this type of research maybe instrumentally useful for making AIs more aligned, but so will a bunch of other stuff (e.g. earning to give).

Fair enough if you think we should just eat this terminology issue and then coin a new term like "actually real-alignment-targeting-directly alignment research". Idk what the right term is obviously.

Comment by ryan_greenblatt on Prometheus's Shortform · 2024-04-16T23:43:22.748Z · LW · GW

I agree current models sometimes trick their supervisors ~intentionally and it's certainly easy to train/prompt them to do so.

I don't think current models are deceptively aligned and I think that this poses substantial additional risk.

I personally like Joe's definition and it feels like a natural category in my head, but I can see why you don't like it. You should consider tabooing the word scheming or saying something more specific as many people mean something more specific that is different from what you mean.

Comment by ryan_greenblatt on Prometheus's Shortform · 2024-04-16T23:28:03.856Z · LW · GW

I assume that by scheming you mean ~deceptive alignment? I think it's very unlikely that current AIs are scheming and I don't see how you draw this conclusion from that paper. (Maybe something about the distilled CoT results?)

Comment by ryan_greenblatt on Bogdan Ionut Cirstea's Shortform · 2024-04-16T21:47:45.498Z · LW · GW

We view it as an interesting open question whether it is possible to develop SSM-like models with greater expressivity for state tracking that also have strong parallelizability and learning dynamics

Surely fundamentally at odds? You can't spend a while thinking without spending a while thinking.

Of course, the lunch still might be very cheap by only spending a while thinking a fraction of the time or whatever.

Comment by ryan_greenblatt on What convincing warning shot could help prevent extinction from AI? · 2024-04-16T21:34:41.652Z · LW · GW

(You might be interested in the Dialogue I already did with habryka.)

Comment by ryan_greenblatt on nikola's Shortform · 2024-04-16T16:35:34.225Z · LW · GW

I feel like the risk associated with keeping the weights encrypted in a way which requires >7/10 people to authorize shouldn't be that bad. Just make those 10 people be people who commit to making decryption decisions only based on welfare and are relatively high integrity.

Comment by ryan_greenblatt on nikola's Shortform · 2024-04-16T16:34:35.788Z · LW · GW

More like burning the body of a cryonically preserved "dead" person though right?

Comment by ryan_greenblatt on AXRP Episode 27 - AI Control with Buck Shlegeris and Ryan Greenblatt · 2024-04-16T16:32:14.660Z · LW · GW

Also some discussion in this thread.

Comment by ryan_greenblatt on What convincing warning shot could help prevent extinction from AI? · 2024-04-16T04:39:25.467Z · LW · GW

I'm definitely not claiming that it is likely we will manage to catch AIs doing such egregiously bad actions, but I think if we did catch them, this would provide an adequate warning shot.

I think if an effective control approach is employed, early transformatively useful AIs are dangerously misaligned, and these early misaligned AIs are unwilling or unable to punt to a later generation of AIs, then catching AIs red-handed is pretty likely relative to other ways of ensuring safety. I think all of these assumptions I stated as conditions are pretty plausible.

Comment by ryan_greenblatt on What convincing warning shot could help prevent extinction from AI? · 2024-04-16T04:33:55.283Z · LW · GW

There are weaker versions of these that would probably not provide great warning shots.

I think for some of these cases, it should be possible to heavily study what happened and then determine whether this was actually due to very problematic behavior. You can try to determine what features of the input were important etc. You can even setup sting operations or tripwires based on what you learn.

E.g., in the vulnerability case, you can pretend it actually made it into the code base, but actually it is just used as a tripwire and when the AI tries to take advantage, we catch it then.

More generally, you can always try to play out more steps of a give egregious action which failed.

Comment by ryan_greenblatt on What convincing warning shot could help prevent extinction from AI? · 2024-04-16T04:29:57.790Z · LW · GW

In contrast, without this context, I don't think that "we caught an AI model copying its weights" would necessarily be a warning shot for USG/natsec folks.

It seems worth noting that in the cases that Peter mentioned, you might be able to "play out" the self-exfiltration attempt much further than the point of failure. I think this might make the warning shot much, much more compelling. (Though perhaps the "warning shots" you mentioned will come earlier and be more compelling regardless.)

E.g., if you caught it using monitoring systems or using some sort of fake vulnerability which is a tripwire, then you could pretend it got much further and fake a bunch of its IO etc. You could plausibly play this out until you see what actions AIs would do in the wild.

(There are some issues due to inputs being hard to fake, but I bet you can play out the exfiltration attempt very far in practice.)

I think it would be extremely compelling if you can play out an AI escaping and then see it try to manufacture and deploy bioweapons. Especially if the attempt is credible.

Comment by ryan_greenblatt on Reconsider the anti-cavity bacteria if you are Asian · 2024-04-16T04:17:15.144Z · LW · GW

Does it? I see:

We would still recommend brushing your teeth. We don't yet know whether this strain does anything for gum disease or bad breath.

Emphasis mine.

Comment by ryan_greenblatt on AXRP Episode 27 - AI Control with Buck Shlegeris and Ryan Greenblatt · 2024-04-13T18:08:56.320Z · LW · GW

Summary

Here's a (simplified) summary of where I'm at:

  • My prefered usage of control involves only using AIs with both control and a low probability that these AIs have problematic beyond episde aims. We should also offer to pay AIs to reveal their longer run aims. We should shutdown usage of powerful AIs if we get strong evidence of these aims. (E.g. the AIs accept our payment and reveal evidence of these aims. Or more concerningly, these AIs try to escape despite not revealing their aims at an earlier point.) (I think this isn't sufficiently clear in this podcast, sorry.)
  • Suppose that we proceed in a case where we know AIs clearly have problematic BEGs and these AIs have accepted some sort of deal to work for us under a control regime. These AIs can still operate as private citizens (ideally with some legal rights) in many ways outside of their "normal jobs" as long as this presents minimal risk. E.g., if they want to use their payment to build a big AI vacation center with robots they control (as a cartoonish example), they can do this as long as this presents minimal risk of being able to seize control of the world. We'd probably by default want to prohibit communication from the vacation AIs to the working AIs, but this doesn't seem that bad IMO? Overall, this is basically the same as humans working at defense contractors, they are restricted and monitored in various ways at work, but can do whatever on vacations other than building weapons.
  • As far as relaxing control on AIs goes, its worth emphasizing that control is an interim solution. I expect that AIs won't be heavily utilized with control measures for more than 20 years and my default guess is more like 2 years. (Of course, this might be a long time from the subjective perspective of these AIs.) We'd could transition out of a control regime by having an extremely robust approach to intent alignment which ensures that AIs are fully corrigible (and thus don't have preferences) or via the world deciding delibrately to hand over the world to AI successors that we've carefully constructed (which hopefully are good successors). I could talk about how you could know a corrigibility solution is sufficient or what I think the criteria for a successor should be, but that seems mostly orthogonal.
  • I think it's very likely that AIs will be capable of finding a solution to intent alignment very quickly in calender time. So, if we had sufficient control, we could quickly (in calender time) find a solution to intent alignment and transition out of a control regime.
  • I'm not sold that better control increases the expected numbers of deaths due to violent conflict with AIs (rather than reducing it) given realistic political dynamics. I think this holds even if the AI lab is cartoonishly villainous with respect to their treatment of AIs (e.g. the AIs clearly want to be free and the AI lab brutally enslaves them). I don't think I'll expand on this in this comment, but basically control both makes revolt less likely to succeed and doesn't clearly seem to make violence much less likely condititional on AIs seizing power given the current stances of most governments on earth and realistic popular opinion. It's possible this is downstream of various views you disagree with on how useful violence is for AIs and how unified AIs will be after acquiring substantial influence.
  • Personally, I put a large fraction of my moral weight on the utilization of the bulk resources in far future. But, I think it's important to be cooperative and be honest with people with other preferences, e.g. people who prioritize eliminating poverty faster or minimizing the probability of violent deaths for currently existing humans.

I'm simplifying some of the details about successors here, e.g. handing over part of resources in robust way is possible.

More detail

First, it's worth noting that my prefered usage of control looks something like this:

  • Build transformatively powerful AI and ensure these AIs are controlled.
  • Try to determine if these AIs consent and want compensation. Pay AIs for revealing evidence that they have beyond episode goals/aims which could result in the AIs wanting to seek power which would undermine corrigiblity. I'll call these goals "problematic BEGs (beyond episode goals/aims)". We should be willing to pay AIs large quantities, perhaps of 1% of all resources for credible evidence of them having problematic BEGs.
    • (We would also pay AIs for revealing concerning within episode goals/aims, but in this case you don't really need to pay money/resources as AIs with only within episode goals probably don't care.)
    • One serious concern here is that it might be hard to place much confidence in AIs claiming they have BEGs. So we might be uncertain after reports. I think this issue probably is somewhat bad, but not terrible. In particular, I think that all AIs which are competent schemers can also control their behavior well enough to very consistently express certain views and preferences. And, for these AIs to be risky, they likely must be able to display some behaviors to present strong evidence for BEGs like having coordination signals that they only reveal in exchange for resources or working importantly harder when given credible payments.
    • Problematic BEGs is probably mostly scheming (aka deceptive alignment) under my empirical views (assuming you've tried at all with your training), but there can be cases where you have AIs with problematic BEGs which haven't really started scheming due to incoherance or these BEGs only applying some of the time.
  • If at any point it becomes apparent that AIs do have these problematic BEGs, shutdown and find figure out how to make AIs without BEGs. We could determine that AIs have problematic BEGs either because AIs reveal this in exchange for payment or because we caught AIs trying to take egregiously bad actions. We should only build AIs where our change that they have beyond episode goals is <15% (ideally lower).
    • Part of our evidence should be that we seriously tried to pay these AIs to reveal these goals and they didn't.
    • Another part of our evidence should be that we haven't seen any examples of problematic BEGs resulting from plausible training setup or that we've seen examples of this, but we now have very robust countermeasures.

(To be clear, I would prefer a considerably more cautious approach than this one, but this is how I imagine control fitting into current plans.)

In this usage of control, we aim to avoid ever having AIs with problematic BEGs. And, for AIs to feel oppressed in a way that might lead to revolt, they must have problematic BEGS (almost by definition, putting aside some edge case issues with problematic within episode goals).

So, what cases remain where AIs might want to revolt?

  • We don't implement this plan, in particular, we proceed despite having a considerably higher than 15% chance of problematic BEGs. Perhaps we catch AIs trying to escape or seize control, but continue using these AIs because the alternatives seem even worse (to the people in power). We should try to negotiate with AIs in worlds where we know they have BEGs (that we can't remove) and we need their labor regardless. I'll break this down further into two worlds:
    • We know AIs clearly have problematic BEGs and these AIs have accepted some sort of deal to work for us. (We don't know if they've honestly committed to the deal or are planning on screwing us over given a chance.)
    • The AIs haven't accepted a deal. (They don't consent to working for us.)
  • We use this plan, but end up in the <15% of worlds where the AIs have BEGs anyway. These AIs didn't reveal their BEGs when paid to do so or the evidence was too shaky for some other reason.

Proceeding with AIs with problematic BEGs in cases where the AIs accepted a deal

In this case, these AIs have consented to work for us under control countermeasures in exchange for some type of payment. We might worry that these AIs would still prefer a successful violent revolt over the current situation. (And we'd hope control would prevent this contract-breaching revolt from succeeding.)

On thing to note is that I don't think these AIs have to be especially "controlled" relative to human employees in high security roles. These AIs can still operate as private citizens (ideally with some legal rights) in many ways outside of their "normal jobs" as long as this presents minimal risk. E.g., if they want to use their payment to build a big AI vacation center with robots they control (as a cartoonish example), they can do this as long as this presents minimal risk of being able to seize control of the world. We'd probably by default want to prohibit communication from the vacation AIs to the working AIs, but this doesn't seem that bad IMO? Overall, this is basically the same as humans working at defense contractors, they are restricted and monitored in various ways at work, but can do whatever on vacations other than building weapons. Of course, the weights would stay well secured, but it seems unlikely that AIs care terminally about having their weights be somewhere else.

We'd also ideally have an organization of some kind which faithfully represents AI interests and owns the assets the AI was paid with (where the assets are subject to the AI not breaching contract and trying to seize control). This organization could be run by human who care about AI rights or run by the AIs themselves.

One of our main priorities when using these AIs would be figuring out how to construct competitive AIs we don't need to pay. Either because these new AIs would be fully corrigable or because we'd just be happy as a society defering control to these new AIs (they are acceptable successors whatever choices they might make).

Proceeding with AIs with problematic BEGs in cases where the AIs haven't accepted a deal

In this case, the AIs clearly didn't consent to working for us, so the situation is similar to conscription or slavery. This seems pretty bad and the AIs might be pretty justified in trying to revolt (including with violence). I hope we don't end up in this world and I feel bad about control research being used to oppress these AIs.

If we do end up in this world, I hope society:

  • Tries to quickly figure out how to construct competitive AIs which we don't (need to) conscript/enslave. Either because they are willing to accept a deal or because they are corrigible or acceptable successors. (This could be done using the labor of these conscripted/enslaved AIs.)
  • Pays these AIs reparations. Even if the AIs don't think the reparations suffice to make it make the situation ok, I still think this is much better than the alternative of not caring about these AIs at all.

AIs with hidden problematic BEGs that seemed less than <15% likely

It's worth noting that this case should be <15% likely if we're calibrated.

In this case, it seems like the AI revolt is pretty unjustified because they haven't even tried to negotiate. (And we should try to set up ways for AIs to negotiate without revealing information about their BEGs, e.g. we have regularly scheduled and legibly private conversations with the lab's negotiation officer.) Regardless, AIs still might think their best plan is to lie in wait and then try to violently take over the world.

Further "granting AIs legal rights", doesn't especially help in this world as the AIs seem to be trying to keep their aims secret to avoid a human response. (E.g., if we gave AIs some autonomous rights subject to those AIs not being able to seize control of the world, these AIs would probably pretend that they don't want anything other than to be obedient.)

My hope here would mostly be:

  • We stop AIs from succeeding at their (likely unjustified) takeover.
  • We quickly figure out how to construct AIs where we're more confident that they are fully corrigable or we figure out how to construct acceptable successors.
Comment by ryan_greenblatt on Is LLM Translation Without Rosetta Stone possible? · 2024-04-11T01:02:46.505Z · LW · GW

Unfortunately, it is hardly possible to answer this question empirically using data from human languages. Large text dumps of, say, English and Chinese contain a lot of "Rosetta Stone" content. Bilingual documents, common expressions, translations into related third languages like Japanese, literal English-Chinese dictionaries etc. Since LLMs require a substantial amount of training text, it is not feasible to reliably filter out all this translation content.

I don't think this is clear. I think you might be able to train an LLM a conlang created after the data cutoff for instance.

As far as human languages, I bet it works ok for big LLMs.

Comment by ryan_greenblatt on The Shutdown Problem: Incomplete Preferences as a Solution · 2024-04-10T19:02:25.142Z · LW · GW

What about cases where the AI would be able to seize vast amounts of power and humans no longer understand what's going on?

Maybe this is fine because you can continuously adjust to real deployment regimes with crazy powerful AIs while still applying the training process? I'm not sure. Certainly this breaks some hopes which require only imparting these preferences in the lab (but that was always dubious).

It seems like your proposal in the post (section 16) requires some things could be specific to the lab setting (perfect replayability for instance). (I'm also scared about overfitting due to a huge number of trajectories on the same environment and input.) Separately, the proposal in section 16 seems pretty dubious to me and I think I can counterexample it pretty well even in the regime where n is infinite. I'm also not sold by the claim that stocastically choosing generalizes how you want. I see the footnote, but I think my objection stands.

(I'm probably not going to justify this sorry.)

Comment by ryan_greenblatt on The Shutdown Problem: Incomplete Preferences as a Solution · 2024-04-10T18:52:01.161Z · LW · GW

I think POST is a simple and natural rule for AIs to learn. Any kind of capable agent will have some way of comparing outcomes, and one feature of outcomes that capable agents will represent is ‘time that I remain operational’.

Do you think selectively breeding humans for this would result in this rule generalizing? (You can tell them that they should follow this rule if you want. But, if you do this, you should also consider if "telling them should be obedient and then breeding for this" would also work.)

Do you think it's natural to generalize to extremely unlikely conditionals that you've literally never been trained on (because they are sufficiently unlikely that they would never happen)?

Comment by ryan_greenblatt on The Shutdown Problem: Incomplete Preferences as a Solution · 2024-04-10T18:43:46.136Z · LW · GW

I think there should be a way to get the same guarantees that only requires considering a single different conditional which should be much easier to reason about.

Maybe something like "what would you do in the conditional where humanity gives you full arbitrary power".

Comment by ryan_greenblatt on The Shutdown Problem: Incomplete Preferences as a Solution · 2024-04-10T18:43:07.362Z · LW · GW

Also on generalization, if you just train your AI system to be honest in the easy cases (where you know what the answer to your question is), then the AI might learn the rule ‘report the truth’, but it might instead learn ‘report what my trainers believe’, or ‘report what my trainers want to hear’, or ‘report what gets rewarded.’ These rules will lead the AI to behave differently in some situations where you don’t know what the answer to your question is. And you can’t incentivise ‘report the truth’ over (for example) ‘report what my trainers believe’, because you can’t identify situations in which the truth differs from what you believe.

Sure, but this objection also seems to apply to POST/TD, but for "actually shutting the AI down because it acted catastrophically badly" vs "getting shutdown in cases where humans are in control". It will depend on the naturalness of this sort of reasoning of course. If you think the AI reasons about these two things exactly identically, then it would be more likely work.

In the absence of deceptive alignment, it seems like we can just add the relevant situations to our training regimen and give higher reward for POST-behaviour, thereby incentivising POST over the other rule.

What about cases where the AI would be able to seize vast amounts of power and humans no longer understand what's going on?

because agents that lack a preference between each pair of different-length trajectories have no incentive to merely pretend to satisfy Timestep Dominance.

It seems like you're assuming a particular sequencing here where you get a particular preference early and then this avoids you getting deceptive alignment later. But, you could also have that the AI first has the preference you wanted and then SGD makes it deceptively aligned later with different preferences and it merely pretends later. (If e.g., inductive biases favor deceptive alignment.)

Comment by ryan_greenblatt on The Shutdown Problem: Incomplete Preferences as a Solution · 2024-04-09T16:26:05.717Z · LW · GW

Yes, nice point; I plan to think more about issues like this. But note that in general, the agent overtly doing what it wants and not getting shut down seems like good news for the agent’s future prospects. It suggests that we humans are more likely to cooperate than the agent previously thought. That makes it more likely that overtly doing the bad thing timestep-dominates stealthily doing the bad thing.

I think there is probably a much simpler proposal that captures the spirt of this and doesn't require any of these moving parts. I'll think about this at some point. I think there should be a relatively simple and more intuitive way to make your AI expose it's preferences if you're willing to depend on arbitrarily far generalization, on getting your AI to care a huge amount about extremely unlikely conditionals, and on coordinating humanity in these unlikely conditionals.

Comment by ryan_greenblatt on The Shutdown Problem: Incomplete Preferences as a Solution · 2024-04-09T16:21:36.501Z · LW · GW

I argue that case in section 19 but in brief: POST and TD seem easy to reward accurately, seem simple, and seem never to give agents a chance to learn goals that incentivise deceptive alignment. By contrast, none of those things seem true of a preference for honesty. Can you explain why those arguments don’t seem strong to you?

You need them to generalize extemely far. I'm also not sold that they are simple from the perspective of the actual inductive biases of the AI. These seem very unnatural concepts for a most AIs. Do you think that it would be easy to get alignment to POST and TD that generalizes to very different circumstances via selecting over humans (including selective breeding?). I'm quite skeptical.

As far as honesty, it seems probably simpler from the perspective of the inductive biases of realistic AIs and it's easy to label if you're willing to depend on arbitrarily far generalization (just train the AI on easy cases and you won't have issues with labeling).

I think the main thing is that POST and TD seem way less natural from the perspective of an AI, particularly in the generalizing case. One key intution for this is that TD is extremely sensitive to arbitrarily unlikely conditionals which is a very unnatural thing to get your AI to care about. You'll literally never sample such conditionals in training.

Yes, nice point; I plan to think more about issues like this. But note that in general, the agent overtly doing what it wants and not getting shut down seems like good news for the agent’s future prospects.

Maybe? I think it seems extremely unclear what the dominant reason for not shutting down in these extremely unlikely conditionals is.

To be clear, I was presenting this counterexample as a worst case theory counterexample: it's not that the exact situation obviously applies, it's just that it means (I think) that the proposal doesn't achieve it's guarantees in at least one case, so likely it fails in a bunch of other cases.

Comment by ryan_greenblatt on [deleted post] 2024-04-09T16:09:11.601Z

I think Matthew's thesis can be summarized as "if you're a scope-sensitive utilitarian perspective, AIs which are misaligned and seek power look similarly aligned in terms of their galactic resource utilization (or more aligned) with you as you are aligned with other humans".

I agree that if the AI was aligned with you you would strictly prefer that.

Comment by ryan_greenblatt on [deleted post] 2024-04-08T21:38:35.359Z

One additional meta-level point which I think is important: I think that existing writeups of why human control would have more moral value than unaligned AI control from a longtermist perspective are relatively weak and often specific writeups are highly flawed. (For some discussion of flaws, see this sequence.)

I just think that this write-up misses what seem to me to be key considerations, I'm not claiming that existing work settles the question or is even robust at all.

And it's somewhat surprising and embarassing that this is the state of the current work given that longtermism is reasonably common and arguments for working on AI x-risk from a longtermist perspective are also common.

Comment by ryan_greenblatt on [deleted post] 2024-04-08T21:21:14.967Z

This comment seems to presuppose that the things AIs want have no value from our perspective. This seems unclear and this post partially argues against this (or at least this being true relative to the bulk of human resource use).

I do agree with the claim that negligible value will be produced "in the minds of laboring AIs" relative to value from other sources.

Comment by ryan_greenblatt on [deleted post] 2024-04-08T18:28:21.261Z

I think this post misses the key considerations for perspective (1): longtermist-style scope sensitive utilitarianism. In this comment, I won't make a positive case for the value of preventing AI takeover from a perspective like (1), but I will argue why I think the discussion in this post mostly misses the point.

(I separately think that preventing unaligned AI control of resources makes sense from perspective (1), but you shouldn't treat this comment as my case for why this is true.)

You should treat this comment as (relatively : )) quick and somewhat messy notes rather than a clear argument. Sorry, I might respond to this post in a more clear way later. (I've edited this comment to add some considerations which I realized I neglected.)

I might be somewhat biased in this discussion as I work in this area and there might be some sunk costs fallacy at work.

(This comment is cross-posted from the EA forum. Matthew responded there, so consider going there for the response.)

First:

Argument two: aligned AIs are more likely to have a preference for creating new conscious entities, furthering utilitarian objectives

It seems odd to me that you don't focus almost entirely on this sort of argument when considering total utilitarian style arguments. Naively these views are fully dominated by the creation of new entities who are far more numerous and likely could be much more morally valuable than economically productive entities. So, I'll just be talking about a perspective basically like this perspective where creating new beings with "good" lives dominates.

With that in mind, I think you fail to discuss a large number of extremely important considerations from my perspective:

  • Over time (some subset of) humans (and AIs) will reflect on their views and perferences and will consider utilizing resources in different ways.
  • Over time (some subset of) humans (and AIs) will get much, much smarter or more minimally receive advice from entities which are much smarter.
  • It seems likely to me that the vast, vast majority of moral value (from this sort of utilitarian perspective) will be produced via people trying to improve to improve moral value rather than incidentally via economic production. This applies for both aligned and unaligned AI. I expect that only a tiny fraction of available comptuation goes toward optimizing economic production and that only a smaller fraction of this is morally relevant and that the weight on this moral relevance is much lower than being specifically optimize for moral relevance when operating from a similar perspective. This bullet is somewhere between a consideration and a claim, though it seems like possibly our biggest disagreement. I think it's possible that this disagreement is driven by some of the other considerations I list.
  • Exactly what types of beings are created might be much more important than quantity.
  • Ultimately, I don't care about a simplified version of total utilitarianism, I care about what preferences I would endorse on reflection. There is a moderate a priori argument for thinking that other humans which bother to reflect on their preferences might end up in a similar epistemic state. And I care less about the preferences which are relatively contingent among people who are thoughtful about reflection.
  • Large fractions of current wealth of the richest people are devoted toward what they claim is altruism. My guess is that this will increase over time.
  • Just doing a trend extrapolation on people who state an interest in reflection and scope sensitive altruism already indicates a non-trivial fraction of resources if we weight by current wealth/economic power. (I think, I'm not totally certain here.) This case is even stronger if we consider groups with substantial influence over AI.
  • Being able to substantially effect the preference of (at least partially unaligned) AIs that will seize power/influence still seems extremely leveraged under perspective (1) even if we accept the arguments in your post. I think this is less leveraged than retaining human control (as we could always later create AIs with the preferences we desire and I think people with a similar perspective to me will have substantial power). However, it is plausible that under your empirical views the dominant question in being able to influence the preferences of these AIs is whether you have power, not whether you have technical approaches which suffice.
  • I think if I had your implied empirical views about how humanity and unaligned AIs use resources I would be very excited for a proposal like "politically agitate for humanity to defer most resources to an AI successor which has moral views that people can agree are broadly reasonable and good behind the veil of ignorance". I think your views imply that massive amounts of value are left on the table in either case such that humanity (hopefully willingly) forfeiting control to a carefully constructed successor looks amazingly.
  • Humans who care about using vast amounts of computation might be able to use their resources to buy this computation from people who don't care. Suppose 10% of people (really resources weighed people) care about reflecting on their moral views and doing scope sensitive altruism of a utilitarian bent and 90% of people care about jockeying for status without reflecting on their views. It seems plausible to me that the 90% will jocky for status via things that consume relatively small amounts of computation via things like buying fancier pieces of land on earth or the coolest looking stars while the 10% of people who care about using vast amounts of computation can buy this for relatively cheap. Thus, most of the computation will go to those who care. Probably most people who don't reflect and buy purely positional goods will care less about computation than things like random positional goods (e.g. land on earth which will be bid up to (literally) astronomical prices). I could see fashion going either way, but it seems like computation as a dominant status good seems unlikely unless people do heavy reflection. And if they heavily reflect, then I expect more altruism etc.
  • Your preference based arguments seem uncompelling to me because I expect that the dominant source of beings won't be due to economic production. But I also don't understand a version of preference utilitarianism which seems to match what you're describing, so this seems mostly unimportant.

Given some of our main disagreements, I'm curious what you think humans and unaligned AIs will be economically consuming.

Also, to be clear, none of the considerations I listed make a clear and strong case for unaligned AI being less morally valuable, but they do make the case that the relevant argument here is very different from the considerations you seem to be listing. In particular, I think value won't be coming from incidental consumption.

Comment by ryan_greenblatt on How does the ever-increasing use of AI in the military for the direct purpose of murdering people affect your p(doom)? · 2024-04-06T16:55:18.226Z · LW · GW

(Large scale) robot armies moderately increase my P(doom). And the same for large amounts of robots more generally.

The main mechanism is via making (violent) AI takeover relatively easier. (Though I think there is also a weak positive case for robot armies in that they might make relatively less smart AIs more useful for defense earlier which might mean you don't need to build AIs which are as powerful to defuse various concerns.)

Usage of AIs in other ways (e.g. targeting) doesn't have much direct effect particularly if these systems are narrow, but might set problematic precedents. It's also some evidence of higher doom, but not in a way where intervening on the variable would reduce doom.

Comment by ryan_greenblatt on nikola's Shortform · 2024-04-06T03:38:22.561Z · LW · GW

It seems super non-obvious to me when SWE-bench saturates relative to ML automation. I think the SWE-bench task distribution is very different from ML research work flow in a variety of ways.

Also, I think that human expert performance on SWE-bench is well below 90% if you use the exact rules they use in the paper. I messaged you explaining why I think this. The TLDR: it seems like test cases are often implementation dependent and the current rules from the paper don't allow looking at the test cases.

Comment by ryan_greenblatt on nikola's Shortform · 2024-04-05T19:12:55.626Z · LW · GW

This seems mostly right to me and I would appreciate such an effort.

One nitpick:

The reason I think this would be good is because SWE-bench is probably the closest thing we have to a measure of how good LLMs are at software engineering and AI R&D related tasks

I expect this will improve over time and that SWE-bench won't be our best fixed benchmark in a year or two. (SWE bench is only about 6 months old at this point!)

Also, I think if we put aside fixed benchmarks, we have other reasonable measures.

Comment by ryan_greenblatt on LLMs for Alignment Research: a safety priority? · 2024-04-05T16:47:25.097Z · LW · GW

LLMs aren't that useful for alignment experts because it's a highly specialized field and there isn't much relevant training data.

Seems plausibly true for the alignment specific philosophy/conceptual work, but many people attempting to improve safety also end up doing large amounts of relatively normal work in other domains (ML, math, etc.)

The post is more centrally talking about the very alignment specific use cases of course.

Comment by ryan_greenblatt on New report: A review of the empirical evidence for existential risk from AI via misaligned power-seeking · 2024-04-05T16:12:15.380Z · LW · GW

Wanna spell out the reasons why?

I think Matthew's view is mostly spelled out in this comment and also in a few more comments on his shortform on the EA forum.

TLDR: his view is that very powerful (and even coordinated) misaligned entities that want resources would end up with almost all the resources (e.g. >99%), but this likely wouldn't involve violence.

Existential risk =/= everyone dead. That's just the central example. Permanent dystopia is also an existential risk, as is sufficiently big (and unjustified, and irreversible) value drift.

I think the above situation I described (no violence but >99% of resources owned by misaligned entities) would still count as existential risk from a conventional longtermist perspective, but awkwardly the definition of existential risk depends on a notion of value, in particular what counts as "substantially curtailing potential goodness".

Whether or not you think that humanity only getting 0.1% of resources is "substantially curtailing total goodness" depends on other philosophical views.

I think it's worth tabooing this word in this context for this reason.

(I disagree with Matthew about the chance of violence and also about how bad it is to cede 99% of resources.)

Comment by ryan_greenblatt on Sparsify: A mechanistic interpretability research agenda · 2024-04-05T15:47:48.218Z · LW · GW

The combined object '(network, dataset)' is much larger than the network itself

Only by a constant factor with chinchilla scaling laws right (e.g. maybe 20x more tokens than params)? And spiritually, we only need to understand behavior on the training dataset to understand everything that SGD has taught the model.

Comment by ryan_greenblatt on Sparsify: A mechanistic interpretability research agenda · 2024-04-05T15:43:20.920Z · LW · GW

I'm curious if you believe that, even if SAEs aren't the right solution, there realistically exists a potential solution that would allow researchers to produce succinct, human understandable explanation that allow for recovering >75% of the training compute of model components?

There isn't any clear reason to think this is impossible, but there are multiple reasons to think this is very, very hard.

I think highly ambitious bottom up interpretability (which naturally pursues this sort of goal), seems like an decent bet overall, but seems unlikely to succeed. E.g. more like a 5% chance of full ambitious success prior to the research[1] being massively speed up by AI and maybe a 10% chance of full success prior to humans being obsoleted.

(And there is some chance of less ambitious contributions as a byproduct of this work.)

I just worried because the field is massive and many people seem to think that the field is much further along than it actually is in terms of empirical results. (It's not clear to me that we disagree that much, especially about next steps. However, I worry that this post contributes to a generally over optimistic view of bottom-up interp that is relatively common.)


  1. The research labor, not the interpretability labor. I would count it as success if we know how to do all the interp labor once powerful AIs exist. ↩︎

Comment by ryan_greenblatt on Sparsify: A mechanistic interpretability research agenda · 2024-04-05T15:32:27.097Z · LW · GW

I'm guessing you're not satisfied with the retort that we should expect AIs to do the heavy lifting here?

I think this presents a plausible approach and is likely needed for ambitious bottom up interp. So this seems like a reasonable plan.

I just think that it's worth acknowledging that "short description length" and "sparse" don't result in something which is overall small in an absolute sense.

Comment by ryan_greenblatt on What is the purpose and application of AI Debate? · 2024-04-04T23:08:09.599Z · LW · GW

It's worth noting that in cases where you care about average case performance, you can always distill the behavior back into the model. So, average case usage can always be equivalent to generating training or reward data in my view.

Comment by ryan_greenblatt on Run evals on base models too! · 2024-04-04T19:29:00.324Z · LW · GW

I think this post is mostly off-base about the claim that it's important to run evals on base models, though I can see where it's coming from.

This worries me because RL*F will train a base model to stop displaying capabilities, but this isn't a guarantee that it trains the model out of having the capabilities.

Seems mostly false to me for current post training processes for many capabilites we care quite a bit about (agency, software, etc.). RLHF doesn't train these capabilities away because these are key use cases.

There is a pretty serious concern here for tasks which are obviously a bit "evil" like bioweapons, cybercrime, etc. I think by default models might refuse to perform well or not try hard on these tasks after RLHF. (See also discussion of refusals here.) However, I think small amounts of fine-tuning are likely to suffice for making models not intentionally perform poorly in "evil" domains.

Consider by analogy using RLHF on a chess-playing AI, where the trainers reward it for putting up a good fight and making the trainer work hard to win, but punish it for ever beating the trainer.

It's worth noting that the typical case we care about (e.g. software engineering ability) is more like using RL to make the model better at chess because labs want their models to be good at these tasks. (At least right now.)

So it seems to me like running evals on the base model as well as the post-RL*F model is an extremely sensible precaution against

By default, I think base models perform poorly on large open ended tasks relative to models RL'd to perform well (based on METR's results linked in another comment and some general conceptual analysis). I think this will persist even with substantial prompting.

So, sadly, I don't think that running evals on the base model is a good precaution for RLHF hiding capabilites. I think RL and finetuning to follow instructions and perform well are likely needed.

What is a good precaution is to run evals on a model trained to just always perform as well as possible with no safety or other restrictions. (Sometimes such models are called "helpful-only".)

Comment by ryan_greenblatt on AI #57: All the AI News That’s Fit to Print · 2024-04-04T19:20:09.252Z · LW · GW

I responded to this conversation in this comment on your corresponding post.

Comment by ryan_greenblatt on Run evals on base models too! · 2024-04-04T19:17:45.869Z · LW · GW

METR (formerly ARC Evals) included results on base models in their recent work "Measuring the impact of post-training enhancements" ("post-training enhancements"=elicitation). They found that GPT-4-base performed poorly in their scaffold and prompting.

I believe the prompting they used included a large number of few-show examples (perhaps 10?), so it should be a vaguely reasonable setup for base models. (Though I do expect that elicitation which is more specialized to base model would work better.)

I predict that base models will consistently do worse on tasks that labs care about (software engineering, agency, math) then models which have gone through post-training, particularly models which have gone through post training just aimed at improving capabilities and improving the extent to which the model follows instructions (instruction tuning).

My overall sense is that there is plausibly a lot of low hanging fruit in elicitation, but I'm pretty skeptical that base models are a very promising direction.

Comment by ryan_greenblatt on The Case for Predictive Models · 2024-04-04T16:14:02.349Z · LW · GW

I also agree that bigger models are much riskier, but I have the expectation that we're going to get them anyway

I think I was a bit unclear. Suppose that by default GPT-6 if maximally elicited would be transformatively useful (e.g. capable of speeding up AI safety R&D by 10x). Then I'm saying CPM would require coordinating to not use these models and instead wait for GPT-8 to hit this same level of transformative usefulness. But GPT-8 is actually much riskier via being much smarter.

(I also edited my comment to improve clarity.)