Posts

How useful is "AI Control" as a framing on AI X-Risk? 2024-03-14T18:06:30.459Z
Notes on control evaluations for safety cases 2024-02-28T16:15:17.799Z
Preventing model exfiltration with upload limits 2024-02-06T16:29:33.999Z
The case for ensuring that powerful AIs are controlled 2024-01-24T16:11:51.354Z
Managing catastrophic misuse without robust AIs 2024-01-16T17:27:31.112Z
Catching AIs red-handed 2024-01-05T17:43:10.948Z
Measurement tampering detection as a special case of weak-to-strong generalization 2023-12-23T00:05:55.357Z
Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem 2023-12-16T05:49:23.672Z
AI Control: Improving Safety Despite Intentional Subversion 2023-12-13T15:51:35.982Z
Auditing failures vs concentrated failures 2023-12-11T02:47:35.703Z
How useful is mechanistic interpretability? 2023-12-01T02:54:53.488Z
Preventing Language Models from hiding their reasoning 2023-10-31T14:34:04.633Z
ryan_greenblatt's Shortform 2023-10-30T16:51:46.769Z
Improving the Welfare of AIs: A Nearcasted Proposal 2023-10-30T14:51:35.901Z
What's up with "Responsible Scaling Policies"? 2023-10-29T04:17:07.839Z
Benchmarks for Detecting Measurement Tampering [Redwood Research] 2023-09-05T16:44:48.032Z
Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy 2023-07-26T17:02:56.456Z
Two problems with ‘Simulators’ as a frame 2023-02-17T23:34:20.787Z
Causal scrubbing: results on induction heads 2022-12-03T00:59:18.327Z
Causal scrubbing: results on a paren balance checker 2022-12-03T00:59:08.078Z
Causal scrubbing: Appendix 2022-12-03T00:58:45.850Z
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] 2022-12-03T00:58:36.973Z
Potential gears level explanations of smooth progress 2021-12-22T18:05:59.264Z
Researcher incentives cause smoother progress on benchmarks 2021-12-21T04:13:48.758Z
Framing approaches to alignment and the hard problem of AI cognition 2021-12-15T19:06:52.640Z
Naive self-supervised approaches to truthful AI 2021-10-23T13:03:01.369Z
Large corporations can unilaterally ban/tax ransomware payments via bets 2021-07-17T12:56:12.156Z
Questions about multivitamins, especially manganese 2021-06-19T16:09:21.566Z

Comments

Comment by ryan_greenblatt on What convincing warning shot could help prevent extinction from AI? · 2024-04-16T04:39:25.467Z · LW · GW

I'm definitely not claiming that it is likely we will manage to catch AIs doing such egregiously bad actions, but I think if we did catch them, this would provide an adequate warning shot.

I think if an effective control approach is employed, early transformatively useful AIs are dangerously misaligned, and these early misaligned AIs are unwilling or unable to punt to a later generation of AIs, then catching AIs red-handed is pretty likely relative to other ways of ensuring safety. I think all of these assumptions I stated as conditions are pretty plausible.

Comment by ryan_greenblatt on What convincing warning shot could help prevent extinction from AI? · 2024-04-16T04:33:55.283Z · LW · GW

There are weaker versions of these that would probably not provide great warning shots.

I think for some of these cases, it should be possible to heavily study what happened and then determine whether this was actually due to very problematic behavior. You can try to determine what features of the input were important etc. You can even setup sting operations or tripwires based on what you learn.

E.g., in the vulnerability case, you can pretend it actually made it into the code base, but actually it is just used as a tripwire and when the AI tries to take advantage, we catch it then.

More generally, you can always try to play out more steps of a give egregious action which failed.

Comment by ryan_greenblatt on What convincing warning shot could help prevent extinction from AI? · 2024-04-16T04:29:57.790Z · LW · GW

In contrast, without this context, I don't think that "we caught an AI model copying its weights" would necessarily be a warning shot for USG/natsec folks.

It seems worth noting that in the cases that Peter mentioned, you might be able to "play out" the self-exfiltration attempt much further than the point of failure. I think this might make the warning shot much, much more compelling. (Though perhaps the "warning shots" you mentioned will come earlier and be more compelling regardless.)

E.g., if you caught it using monitoring systems or using some sort of fake vulnerability which is a tripwire, then you could pretend it got much further and fake a bunch of its IO etc. You could plausibly play this out until you see what actions AIs would do in the wild.

(There are some issues due to inputs being hard to fake, but I bet you can play out the exfiltration attempt very far in practice.)

I think it would be extremely compelling if you can play out an AI escaping and then see it try to manufacture and deploy bioweapons. Especially if the attempt is credible.

Comment by ryan_greenblatt on Reconsider the anti-cavity bacteria if you are Asian · 2024-04-16T04:17:15.144Z · LW · GW

Does it? I see:

We would still recommend brushing your teeth. We don't yet know whether this strain does anything for gum disease or bad breath.

Emphasis mine.

Comment by ryan_greenblatt on AXRP Episode 27 - AI Control with Buck Shlegeris and Ryan Greenblatt · 2024-04-13T18:08:56.320Z · LW · GW

Summary

Here's a (simplified) summary of where I'm at:

  • My prefered usage of control involves only using AIs with both control and a low probability that these AIs have problematic beyond episde aims. We should also offer to pay AIs to reveal their longer run aims. We should shutdown usage of powerful AIs if we get strong evidence of these aims. (E.g. the AIs accept our payment and reveal evidence of these aims. Or more concerningly, these AIs try to escape despite not revealing their aims at an earlier point.) (I think this isn't sufficiently clear in this podcast, sorry.)
  • Suppose that we proceed in a case where we know AIs clearly have problematic BEGs and these AIs have accepted some sort of deal to work for us under a control regime. These AIs can still operate as private citizens (ideally with some legal rights) in many ways outside of their "normal jobs" as long as this presents minimal risk. E.g., if they want to use their payment to build a big AI vacation center with robots they control (as a cartoonish example), they can do this as long as this presents minimal risk of being able to seize control of the world. We'd probably by default want to prohibit communication from the vacation AIs to the working AIs, but this doesn't seem that bad IMO? Overall, this is basically the same as humans working at defense contractors, they are restricted and monitored in various ways at work, but can do whatever on vacations other than building weapons.
  • As far as relaxing control on AIs goes, its worth emphasizing that control is an interim solution. I expect that AIs won't be heavily utilized with control measures for more than 20 years and my default guess is more like 2 years. (Of course, this might be a long time from the subjective perspective of these AIs.) We'd could transition out of a control regime by having an extremely robust approach to intent alignment which ensures that AIs are fully corrigible (and thus don't have preferences) or via the world deciding delibrately to hand over the world to AI successors that we've carefully constructed (which hopefully are good successors). I could talk about how you could know a corrigibility solution is sufficient or what I think the criteria for a successor should be, but that seems mostly orthogonal.
  • I think it's very likely that AIs will be capable of finding a solution to intent alignment very quickly in calender time. So, if we had sufficient control, we could quickly (in calender time) find a solution to intent alignment and transition out of a control regime.
  • I'm not sold that better control increases the expected numbers of deaths due to violent conflict with AIs (rather than reducing it) given realistic political dynamics. I think this holds even if the AI lab is cartoonishly villainous with respect to their treatment of AIs (e.g. the AIs clearly want to be free and the AI lab brutally enslaves them). I don't think I'll expand on this in this comment, but basically control both makes revolt less likely to succeed and doesn't clearly seem to make violence much less likely condititional on AIs seizing power given the current stances of most governments on earth and realistic popular opinion. It's possible this is downstream of various views you disagree with on how useful violence is for AIs and how unified AIs will be after acquiring substantial influence.
  • Personally, I put a large fraction of my moral weight on the utilization of the bulk resources in far future. But, I think it's important to be cooperative and be honest with people with other preferences, e.g. people who prioritize eliminating poverty faster or minimizing the probability of violent deaths for currently existing humans.

I'm simplifying some of the details about successors here, e.g. handing over part of resources in robust way is possible.

More detail

First, it's worth noting that my prefered usage of control looks something like this:

  • Build transformatively powerful AI and ensure these AIs are controlled.
  • Try to determine if these AIs consent and want compensation. Pay AIs for revealing evidence that they have beyond episode goals/aims which could result in the AIs wanting to seek power which would undermine corrigiblity. I'll call these goals "problematic BEGs (beyond episode goals/aims)". We should be willing to pay AIs large quantities, perhaps of 1% of all resources for credible evidence of them having problematic BEGs.
    • (We would also pay AIs for revealing concerning within episode goals/aims, but in this case you don't really need to pay money/resources as AIs with only within episode goals probably don't care.)
    • One serious concern here is that it might be hard to place much confidence in AIs claiming they have BEGs. So we might be uncertain after reports. I think this issue probably is somewhat bad, but not terrible. In particular, I think that all AIs which are competent schemers can also control their behavior well enough to very consistently express certain views and preferences. And, for these AIs to be risky, they likely must be able to display some behaviors to present strong evidence for BEGs like having coordination signals that they only reveal in exchange for resources or working importantly harder when given credible payments.
    • Problematic BEGs is probably mostly scheming (aka deceptive alignment) under my empirical views (assuming you've tried at all with your training), but there can be cases where you have AIs with problematic BEGs which haven't really started scheming due to incoherance or these BEGs only applying some of the time.
  • If at any point it becomes apparent that AIs do have these problematic BEGs, shutdown and find figure out how to make AIs without BEGs. We could determine that AIs have problematic BEGs either because AIs reveal this in exchange for payment or because we caught AIs trying to take egregiously bad actions. We should only build AIs where our change that they have beyond episode goals is <15% (ideally lower).
    • Part of our evidence should be that we seriously tried to pay these AIs to reveal these goals and they didn't.
    • Another part of our evidence should be that we haven't seen any examples of problematic BEGs resulting from plausible training setup or that we've seen examples of this, but we now have very robust countermeasures.

(To be clear, I would prefer a considerably more cautious approach than this one, but this is how I imagine control fitting into current plans.)

In this usage of control, we aim to avoid ever having AIs with problematic BEGs. And, for AIs to feel oppressed in a way that might lead to revolt, they must have problematic BEGS (almost by definition, putting aside some edge case issues with problematic within episode goals).

So, what cases remain where AIs might want to revolt?

  • We don't implement this plan, in particular, we proceed despite having a considerably higher than 15% chance of problematic BEGs. Perhaps we catch AIs trying to escape or seize control, but continue using these AIs because the alternatives seem even worse (to the people in power). We should try to negotiate with AIs in worlds where we know they have BEGs (that we can't remove) and we need their labor regardless. I'll break this down further into two worlds:
    • We know AIs clearly have problematic BEGs and these AIs have accepted some sort of deal to work for us. (We don't know if they've honestly committed to the deal or are planning on screwing us over given a chance.)
    • The AIs haven't accepted a deal. (They don't consent to working for us.)
  • We use this plan, but end up in the <15% of worlds where the AIs have BEGs anyway. These AIs didn't reveal their BEGs when paid to do so or the evidence was too shaky for some other reason.

Proceeding with AIs with problematic BEGs in cases where the AIs accepted a deal

In this case, these AIs have consented to work for us under control countermeasures in exchange for some type of payment. We might worry that these AIs would still prefer a successful violent revolt over the current situation. (And we'd hope control would prevent this contract-breaching revolt from succeeding.)

On thing to note is that I don't think these AIs have to be especially "controlled" relative to human employees in high security roles. These AIs can still operate as private citizens (ideally with some legal rights) in many ways outside of their "normal jobs" as long as this presents minimal risk. E.g., if they want to use their payment to build a big AI vacation center with robots they control (as a cartoonish example), they can do this as long as this presents minimal risk of being able to seize control of the world. We'd probably by default want to prohibit communication from the vacation AIs to the working AIs, but this doesn't seem that bad IMO? Overall, this is basically the same as humans working at defense contractors, they are restricted and monitored in various ways at work, but can do whatever on vacations other than building weapons. Of course, the weights would stay well secured, but it seems unlikely that AIs care terminally about having their weights be somewhere else.

We'd also ideally have an organization of some kind which faithfully represents AI interests and owns the assets the AI was paid with (where the assets are subject to the AI not breaching contract and trying to seize control). This organization could be run by human who care about AI rights or run by the AIs themselves.

One of our main priorities when using these AIs would be figuring out how to construct competitive AIs we don't need to pay. Either because these new AIs would be fully corrigable or because we'd just be happy as a society defering control to these new AIs (they are acceptable successors whatever choices they might make).

Proceeding with AIs with problematic BEGs in cases where the AIs haven't accepted a deal

In this case, the AIs clearly didn't consent to working for us, so the situation is similar to conscription or slavery. This seems pretty bad and the AIs might be pretty justified in trying to revolt (including with violence). I hope we don't end up in this world and I feel bad about control research being used to oppress these AIs.

If we do end up in this world, I hope society:

  • Tries to quickly figure out how to construct competitive AIs which we don't (need to) conscript/enslave. Either because they are willing to accept a deal or because they are corrigible or acceptable successors. (This could be done using the labor of these conscripted/enslaved AIs.)
  • Pays these AIs reparations. Even if the AIs don't think the reparations suffice to make it make the situation ok, I still think this is much better than the alternative of not caring about these AIs at all.

AIs with hidden problematic BEGs that seemed less than <15% likely

It's worth noting that this case should be <15% likely if we're calibrated.

In this case, it seems like the AI revolt is pretty unjustified because they haven't even tried to negotiate. (And we should try to set up ways for AIs to negotiate without revealing information about their BEGs, e.g. we have regularly scheduled and legibly private conversations with the lab's negotiation officer.) Regardless, AIs still might think their best plan is to lie in wait and then try to violently take over the world.

Further "granting AIs legal rights", doesn't especially help in this world as the AIs seem to be trying to keep their aims secret to avoid a human response. (E.g., if we gave AIs some autonomous rights subject to those AIs not being able to seize control of the world, these AIs would probably pretend that they don't want anything other than to be obedient.)

My hope here would mostly be:

  • We stop AIs from succeeding at their (likely unjustified) takeover.
  • We quickly figure out how to construct AIs where we're more confident that they are fully corrigable or we figure out how to construct acceptable successors.
Comment by ryan_greenblatt on Is LLM Translation Without Rosetta Stone possible? · 2024-04-11T01:02:46.505Z · LW · GW

Unfortunately, it is hardly possible to answer this question empirically using data from human languages. Large text dumps of, say, English and Chinese contain a lot of "Rosetta Stone" content. Bilingual documents, common expressions, translations into related third languages like Japanese, literal English-Chinese dictionaries etc. Since LLMs require a substantial amount of training text, it is not feasible to reliably filter out all this translation content.

I don't think this is clear. I think you might be able to train an LLM a conlang created after the data cutoff for instance.

As far as human languages, I bet it works ok for big LLMs.

Comment by ryan_greenblatt on The Shutdown Problem: Incomplete Preferences as a Solution · 2024-04-10T19:02:25.142Z · LW · GW

What about cases where the AI would be able to seize vast amounts of power and humans no longer understand what's going on?

Maybe this is fine because you can continuously adjust to real deployment regimes with crazy powerful AIs while still applying the training process? I'm not sure. Certainly this breaks some hopes which require only imparting these preferences in the lab (but that was always dubious).

It seems like your proposal in the post (section 16) requires some things could be specific to the lab setting (perfect replayability for instance). (I'm also scared about overfitting due to a huge number of trajectories on the same environment and input.) Separately, the proposal in section 16 seems pretty dubious to me and I think I can counterexample it pretty well even in the regime where n is infinite. I'm also not sold by the claim that stocastically choosing generalizes how you want. I see the footnote, but I think my objection stands.

(I'm probably not going to justify this sorry.)

Comment by ryan_greenblatt on The Shutdown Problem: Incomplete Preferences as a Solution · 2024-04-10T18:52:01.161Z · LW · GW

I think POST is a simple and natural rule for AIs to learn. Any kind of capable agent will have some way of comparing outcomes, and one feature of outcomes that capable agents will represent is ‘time that I remain operational’.

Do you think selectively breeding humans for this would result in this rule generalizing? (You can tell them that they should follow this rule if you want. But, if you do this, you should also consider if "telling them should be obedient and then breeding for this" would also work.)

Do you think it's natural to generalize to extremely unlikely conditionals that you've literally never been trained on (because they are sufficiently unlikely that they would never happen)?

Comment by ryan_greenblatt on The Shutdown Problem: Incomplete Preferences as a Solution · 2024-04-10T18:43:46.136Z · LW · GW

I think there should be a way to get the same guarantees that only requires considering a single different conditional which should be much easier to reason about.

Maybe something like "what would you do in the conditional where humanity gives you full arbitrary power".

Comment by ryan_greenblatt on The Shutdown Problem: Incomplete Preferences as a Solution · 2024-04-10T18:43:07.362Z · LW · GW

Also on generalization, if you just train your AI system to be honest in the easy cases (where you know what the answer to your question is), then the AI might learn the rule ‘report the truth’, but it might instead learn ‘report what my trainers believe’, or ‘report what my trainers want to hear’, or ‘report what gets rewarded.’ These rules will lead the AI to behave differently in some situations where you don’t know what the answer to your question is. And you can’t incentivise ‘report the truth’ over (for example) ‘report what my trainers believe’, because you can’t identify situations in which the truth differs from what you believe.

Sure, but this objection also seems to apply to POST/TD, but for "actually shutting the AI down because it acted catastrophically badly" vs "getting shutdown in cases where humans are in control". It will depend on the naturalness of this sort of reasoning of course. If you think the AI reasons about these two things exactly identically, then it would be more likely work.

In the absence of deceptive alignment, it seems like we can just add the relevant situations to our training regimen and give higher reward for POST-behaviour, thereby incentivising POST over the other rule.

What about cases where the AI would be able to seize vast amounts of power and humans no longer understand what's going on?

because agents that lack a preference between each pair of different-length trajectories have no incentive to merely pretend to satisfy Timestep Dominance.

It seems like you're assuming a particular sequencing here where you get a particular preference early and then this avoids you getting deceptive alignment later. But, you could also have that the AI first has the preference you wanted and then SGD makes it deceptively aligned later with different preferences and it merely pretends later. (If e.g., inductive biases favor deceptive alignment.)

Comment by ryan_greenblatt on The Shutdown Problem: Incomplete Preferences as a Solution · 2024-04-09T16:26:05.717Z · LW · GW

Yes, nice point; I plan to think more about issues like this. But note that in general, the agent overtly doing what it wants and not getting shut down seems like good news for the agent’s future prospects. It suggests that we humans are more likely to cooperate than the agent previously thought. That makes it more likely that overtly doing the bad thing timestep-dominates stealthily doing the bad thing.

I think there is probably a much simpler proposal that captures the spirt of this and doesn't require any of these moving parts. I'll think about this at some point. I think there should be a relatively simple and more intuitive way to make your AI expose it's preferences if you're willing to depend on arbitrarily far generalization, on getting your AI to care a huge amount about extremely unlikely conditionals, and on coordinating humanity in these unlikely conditionals.

Comment by ryan_greenblatt on The Shutdown Problem: Incomplete Preferences as a Solution · 2024-04-09T16:21:36.501Z · LW · GW

I argue that case in section 19 but in brief: POST and TD seem easy to reward accurately, seem simple, and seem never to give agents a chance to learn goals that incentivise deceptive alignment. By contrast, none of those things seem true of a preference for honesty. Can you explain why those arguments don’t seem strong to you?

You need them to generalize extemely far. I'm also not sold that they are simple from the perspective of the actual inductive biases of the AI. These seem very unnatural concepts for a most AIs. Do you think that it would be easy to get alignment to POST and TD that generalizes to very different circumstances via selecting over humans (including selective breeding?). I'm quite skeptical.

As far as honesty, it seems probably simpler from the perspective of the inductive biases of realistic AIs and it's easy to label if you're willing to depend on arbitrarily far generalization (just train the AI on easy cases and you won't have issues with labeling).

I think the main thing is that POST and TD seem way less natural from the perspective of an AI, particularly in the generalizing case. One key intution for this is that TD is extremely sensitive to arbitrarily unlikely conditionals which is a very unnatural thing to get your AI to care about. You'll literally never sample such conditionals in training.

Yes, nice point; I plan to think more about issues like this. But note that in general, the agent overtly doing what it wants and not getting shut down seems like good news for the agent’s future prospects.

Maybe? I think it seems extremely unclear what the dominant reason for not shutting down in these extremely unlikely conditionals is.

To be clear, I was presenting this counterexample as a worst case theory counterexample: it's not that the exact situation obviously applies, it's just that it means (I think) that the proposal doesn't achieve it's guarantees in at least one case, so likely it fails in a bunch of other cases.

Comment by ryan_greenblatt on [deleted post] 2024-04-09T16:09:11.601Z

I think Matthew's thesis can be summarized as "if you're a scope-sensitive utilitarian perspective, AIs which are misaligned and seek power look similarly aligned in terms of their galactic resource utilization (or more aligned) with you as you are aligned with other humans".

I agree that if the AI was aligned with you you would strictly prefer that.

Comment by ryan_greenblatt on [deleted post] 2024-04-08T21:38:35.359Z

One additional meta-level point which I think is important: I think that existing writeups of why human control would have more moral value than unaligned AI control from a longtermist perspective are relatively weak and often specific writeups are highly flawed. (For some discussion of flaws, see this sequence.)

I just think that this write-up misses what seem to me to be key considerations, I'm not claiming that existing work settles the question or is even robust at all.

And it's somewhat surprising and embarassing that this is the state of the current work given that longtermism is reasonably common and arguments for working on AI x-risk from a longtermist perspective are also common.

Comment by ryan_greenblatt on [deleted post] 2024-04-08T21:21:14.967Z

This comment seems to presuppose that the things AIs want have no value from our perspective. This seems unclear and this post partially argues against this (or at least this being true relative to the bulk of human resource use).

I do agree with the claim that negligible value will be produced "in the minds of laboring AIs" relative to value from other sources.

Comment by ryan_greenblatt on [deleted post] 2024-04-08T18:28:21.261Z

I think this post misses the key considerations for perspective (1): longtermist-style scope sensitive utilitarianism. In this comment, I won't make a positive case for the value of preventing AI takeover from a perspective like (1), but I will argue why I think the discussion in this post mostly misses the point.

(I separately think that preventing unaligned AI control of resources makes sense from perspective (1), but you shouldn't treat this comment as my case for why this is true.)

You should treat this comment as (relatively : )) quick and somewhat messy notes rather than a clear argument. Sorry, I might respond to this post in a more clear way later. (I've edited this comment to add some considerations which I realized I neglected.)

I might be somewhat biased in this discussion as I work in this area and there might be some sunk costs fallacy at work.

(This comment is cross-posted from the EA forum. Matthew responded there, so consider going there for the response.)

First:

Argument two: aligned AIs are more likely to have a preference for creating new conscious entities, furthering utilitarian objectives

It seems odd to me that you don't focus almost entirely on this sort of argument when considering total utilitarian style arguments. Naively these views are fully dominated by the creation of new entities who are far more numerous and likely could be much more morally valuable than economically productive entities. So, I'll just be talking about a perspective basically like this perspective where creating new beings with "good" lives dominates.

With that in mind, I think you fail to discuss a large number of extremely important considerations from my perspective:

  • Over time (some subset of) humans (and AIs) will reflect on their views and perferences and will consider utilizing resources in different ways.
  • Over time (some subset of) humans (and AIs) will get much, much smarter or more minimally receive advice from entities which are much smarter.
  • It seems likely to me that the vast, vast majority of moral value (from this sort of utilitarian perspective) will be produced via people trying to improve to improve moral value rather than incidentally via economic production. This applies for both aligned and unaligned AI. I expect that only a tiny fraction of available comptuation goes toward optimizing economic production and that only a smaller fraction of this is morally relevant and that the weight on this moral relevance is much lower than being specifically optimize for moral relevance when operating from a similar perspective. This bullet is somewhere between a consideration and a claim, though it seems like possibly our biggest disagreement. I think it's possible that this disagreement is driven by some of the other considerations I list.
  • Exactly what types of beings are created might be much more important than quantity.
  • Ultimately, I don't care about a simplified version of total utilitarianism, I care about what preferences I would endorse on reflection. There is a moderate a priori argument for thinking that other humans which bother to reflect on their preferences might end up in a similar epistemic state. And I care less about the preferences which are relatively contingent among people who are thoughtful about reflection.
  • Large fractions of current wealth of the richest people are devoted toward what they claim is altruism. My guess is that this will increase over time.
  • Just doing a trend extrapolation on people who state an interest in reflection and scope sensitive altruism already indicates a non-trivial fraction of resources if we weight by current wealth/economic power. (I think, I'm not totally certain here.) This case is even stronger if we consider groups with substantial influence over AI.
  • Being able to substantially effect the preference of (at least partially unaligned) AIs that will seize power/influence still seems extremely leveraged under perspective (1) even if we accept the arguments in your post. I think this is less leveraged than retaining human control (as we could always later create AIs with the preferences we desire and I think people with a similar perspective to me will have substantial power). However, it is plausible that under your empirical views the dominant question in being able to influence the preferences of these AIs is whether you have power, not whether you have technical approaches which suffice.
  • I think if I had your implied empirical views about how humanity and unaligned AIs use resources I would be very excited for a proposal like "politically agitate for humanity to defer most resources to an AI successor which has moral views that people can agree are broadly reasonable and good behind the veil of ignorance". I think your views imply that massive amounts of value are left on the table in either case such that humanity (hopefully willingly) forfeiting control to a carefully constructed successor looks amazingly.
  • Humans who care about using vast amounts of computation might be able to use their resources to buy this computation from people who don't care. Suppose 10% of people (really resources weighed people) care about reflecting on their moral views and doing scope sensitive altruism of a utilitarian bent and 90% of people care about jockeying for status without reflecting on their views. It seems plausible to me that the 90% will jocky for status via things that consume relatively small amounts of computation via things like buying fancier pieces of land on earth or the coolest looking stars while the 10% of people who care about using vast amounts of computation can buy this for relatively cheap. Thus, most of the computation will go to those who care. Probably most people who don't reflect and buy purely positional goods will care less about computation than things like random positional goods (e.g. land on earth which will be bid up to (literally) astronomical prices). I could see fashion going either way, but it seems like computation as a dominant status good seems unlikely unless people do heavy reflection. And if they heavily reflect, then I expect more altruism etc.
  • Your preference based arguments seem uncompelling to me because I expect that the dominant source of beings won't be due to economic production. But I also don't understand a version of preference utilitarianism which seems to match what you're describing, so this seems mostly unimportant.

Given some of our main disagreements, I'm curious what you think humans and unaligned AIs will be economically consuming.

Also, to be clear, none of the considerations I listed make a clear and strong case for unaligned AI being less morally valuable, but they do make the case that the relevant argument here is very different from the considerations you seem to be listing. In particular, I think value won't be coming from incidental consumption.

Comment by ryan_greenblatt on How does the ever-increasing use of AI in the military for the direct purpose of murdering people affect your p(doom)? · 2024-04-06T16:55:18.226Z · LW · GW

(Large scale) robot armies moderately increase my P(doom). And the same for large amounts of robots more generally.

The main mechanism is via making (violent) AI takeover relatively easier. (Though I think there is also a weak positive case for robot armies in that they might make relatively less smart AIs more useful for defense earlier which might mean you don't need to build AIs which are as powerful to defuse various concerns.)

Usage of AIs in other ways (e.g. targeting) doesn't have much direct effect particularly if these systems are narrow, but might set problematic precedents. It's also some evidence of higher doom, but not in a way where intervening on the variable would reduce doom.

Comment by ryan_greenblatt on nikola's Shortform · 2024-04-06T03:38:22.561Z · LW · GW

It seems super non-obvious to me when SWE-bench saturates relative to ML automation. I think the SWE-bench task distribution is very different from ML research work flow in a variety of ways.

Also, I think that human expert performance on SWE-bench is well below 90% if you use the exact rules they use in the paper. I messaged you explaining why I think this. The TLDR: it seems like test cases are often implementation dependent and the current rules from the paper don't allow looking at the test cases.

Comment by ryan_greenblatt on nikola's Shortform · 2024-04-05T19:12:55.626Z · LW · GW

This seems mostly right to me and I would appreciate such an effort.

One nitpick:

The reason I think this would be good is because SWE-bench is probably the closest thing we have to a measure of how good LLMs are at software engineering and AI R&D related tasks

I expect this will improve over time and that SWE-bench won't be our best fixed benchmark in a year or two. (SWE bench is only about 6 months old at this point!)

Also, I think if we put aside fixed benchmarks, we have other reasonable measures.

Comment by ryan_greenblatt on LLMs for Alignment Research: a safety priority? · 2024-04-05T16:47:25.097Z · LW · GW

LLMs aren't that useful for alignment experts because it's a highly specialized field and there isn't much relevant training data.

Seems plausibly true for the alignment specific philosophy/conceptual work, but many people attempting to improve safety also end up doing large amounts of relatively normal work in other domains (ML, math, etc.)

The post is more centrally talking about the very alignment specific use cases of course.

Comment by ryan_greenblatt on New report: A review of the empirical evidence for existential risk from AI via misaligned power-seeking · 2024-04-05T16:12:15.380Z · LW · GW

Wanna spell out the reasons why?

I think Matthew's view is mostly spelled out in this comment and also in a few more comments on his shortform on the EA forum.

TLDR: his view is that very powerful (and even coordinated) misaligned entities that want resources would end up with almost all the resources (e.g. >99%), but this likely wouldn't involve violence.

Existential risk =/= everyone dead. That's just the central example. Permanent dystopia is also an existential risk, as is sufficiently big (and unjustified, and irreversible) value drift.

I think the above situation I described (no violence but >99% of resources owned by misaligned entities) would still count as existential risk from a conventional longtermist perspective, but awkwardly the definition of existential risk depends on a notion of value, in particular what counts as "substantially curtailing potential goodness".

Whether or not you think that humanity only getting 0.1% of resources is "substantially curtailing total goodness" depends on other philosophical views.

I think it's worth tabooing this word in this context for this reason.

(I disagree with Matthew about the chance of violence and also about how bad it is to cede 99% of resources.)

Comment by ryan_greenblatt on Sparsify: A mechanistic interpretability research agenda · 2024-04-05T15:47:48.218Z · LW · GW

The combined object '(network, dataset)' is much larger than the network itself

Only by a constant factor with chinchilla scaling laws right (e.g. maybe 20x more tokens than params)? And spiritually, we only need to understand behavior on the training dataset to understand everything that SGD has taught the model.

Comment by ryan_greenblatt on Sparsify: A mechanistic interpretability research agenda · 2024-04-05T15:43:20.920Z · LW · GW

I'm curious if you believe that, even if SAEs aren't the right solution, there realistically exists a potential solution that would allow researchers to produce succinct, human understandable explanation that allow for recovering >75% of the training compute of model components?

There isn't any clear reason to think this is impossible, but there are multiple reasons to think this is very, very hard.

I think highly ambitious bottom up interpretability (which naturally pursues this sort of goal), seems like an decent bet overall, but seems unlikely to succeed. E.g. more like a 5% chance of full ambitious success prior to the research[1] being massively speed up by AI and maybe a 10% chance of full success prior to humans being obsoleted.

(And there is some chance of less ambitious contributions as a byproduct of this work.)

I just worried because the field is massive and many people seem to think that the field is much further along than it actually is in terms of empirical results. (It's not clear to me that we disagree that much, especially about next steps. However, I worry that this post contributes to a generally over optimistic view of bottom-up interp that is relatively common.)


  1. The research labor, not the interpretability labor. I would count it as success if we know how to do all the interp labor once powerful AIs exist. ↩︎

Comment by ryan_greenblatt on Sparsify: A mechanistic interpretability research agenda · 2024-04-05T15:32:27.097Z · LW · GW

I'm guessing you're not satisfied with the retort that we should expect AIs to do the heavy lifting here?

I think this presents a plausible approach and is likely needed for ambitious bottom up interp. So this seems like a reasonable plan.

I just think that it's worth acknowledging that "short description length" and "sparse" don't result in something which is overall small in an absolute sense.

Comment by ryan_greenblatt on What is the purpose and application of AI Debate? · 2024-04-04T23:08:09.599Z · LW · GW

It's worth noting that in cases where you care about average case performance, you can always distill the behavior back into the model. So, average case usage can always be equivalent to generating training or reward data in my view.

Comment by ryan_greenblatt on Run evals on base models too! · 2024-04-04T19:29:00.324Z · LW · GW

I think this post is mostly off-base about the claim that it's important to run evals on base models, though I can see where it's coming from.

This worries me because RL*F will train a base model to stop displaying capabilities, but this isn't a guarantee that it trains the model out of having the capabilities.

Seems mostly false to me for current post training processes for many capabilites we care quite a bit about (agency, software, etc.). RLHF doesn't train these capabilities away because these are key use cases.

There is a pretty serious concern here for tasks which are obviously a bit "evil" like bioweapons, cybercrime, etc. I think by default models might refuse to perform well or not try hard on these tasks after RLHF. (See also discussion of refusals here.) However, I think small amounts of fine-tuning are likely to suffice for making models not intentionally perform poorly in "evil" domains.

Consider by analogy using RLHF on a chess-playing AI, where the trainers reward it for putting up a good fight and making the trainer work hard to win, but punish it for ever beating the trainer.

It's worth noting that the typical case we care about (e.g. software engineering ability) is more like using RL to make the model better at chess because labs want their models to be good at these tasks. (At least right now.)

So it seems to me like running evals on the base model as well as the post-RL*F model is an extremely sensible precaution against

By default, I think base models perform poorly on large open ended tasks relative to models RL'd to perform well (based on METR's results linked in another comment and some general conceptual analysis). I think this will persist even with substantial prompting.

So, sadly, I don't think that running evals on the base model is a good precaution for RLHF hiding capabilites. I think RL and finetuning to follow instructions and perform well are likely needed.

What is a good precaution is to run evals on a model trained to just always perform as well as possible with no safety or other restrictions. (Sometimes such models are called "helpful-only".)

Comment by ryan_greenblatt on AI #57: All the AI News That’s Fit to Print · 2024-04-04T19:20:09.252Z · LW · GW

I responded to this conversation in this comment on your corresponding post.

Comment by ryan_greenblatt on Run evals on base models too! · 2024-04-04T19:17:45.869Z · LW · GW

METR (formerly ARC Evals) included results on base models in their recent work "Measuring the impact of post-training enhancements" ("post-training enhancements"=elicitation). They found that GPT-4-base performed poorly in their scaffold and prompting.

I believe the prompting they used included a large number of few-show examples (perhaps 10?), so it should be a vaguely reasonable setup for base models. (Though I do expect that elicitation which is more specialized to base model would work better.)

I predict that base models will consistently do worse on tasks that labs care about (software engineering, agency, math) then models which have gone through post-training, particularly models which have gone through post training just aimed at improving capabilities and improving the extent to which the model follows instructions (instruction tuning).

My overall sense is that there is plausibly a lot of low hanging fruit in elicitation, but I'm pretty skeptical that base models are a very promising direction.

Comment by ryan_greenblatt on The Case for Predictive Models · 2024-04-04T16:14:02.349Z · LW · GW

I also agree that bigger models are much riskier, but I have the expectation that we're going to get them anyway

I think I was a bit unclear. Suppose that by default GPT-6 if maximally elicited would be transformatively useful (e.g. capable of speeding up AI safety R&D by 10x). Then I'm saying CPM would require coordinating to not use these models and instead wait for GPT-8 to hit this same level of transformative usefulness. But GPT-8 is actually much riskier via being much smarter.

(I also edited my comment to improve clarity.)

Comment by ryan_greenblatt on The Case for Predictive Models · 2024-04-03T23:16:46.141Z · LW · GW

For the proposed safety strategy (conditioning models to generate safety research based on alternative future worlds) to beat naive baselines (RLHF), you need:

  • The CPM abstraction to hold extremely strongly in unlikely ways. E.g., models need to generalize basically like this.
  • The advantage has to be coming from understanding exactly what conditional you're getting. In other words, the key property is an interpretability type property where you have a more mechanistic understanding of what's going on. Let's suppose you're getting the conditional via prompting. If you just look at the output and then iterate on prompts until you get outputs that seem to perform better where most of the optimization isn't understood, then your basically back in the RL case.
    • It seems actually hard to understand what conditional you'll get from a prompt. This also might be limited by the model's overall understanding.
  • I think it's quite unlikely that extracting human understandable conditionals is competitive with other training methods (RL, SFT). This is particularly because it will be hard to understand exactly what conditional you're getting.
    • I think you probably get wrecked by models needing to understand that they are AIs to at least some extent.
    • I think you also plausibly get wrecked by models detecting that they are AIs and then degrading to GPT-3.5 level performance.
    • You could hope for substantial coordination to wait for even bigger models that you only use via CPM, but I think bigger models are much riskier than making transformatively useful ai via well elicited smaller models so this seems to just make the situation worse putting aside coordination feasibility.

TBC, I think that some insight like "models might generalize in a conditioning-ish sort of way even after RL, maybe we should make some tweaks to our training process to improve safety based on this hypothesis " seems like a good idea. But this isn't really an overall safety proposal IMO and a bunch of the other ideas in the Conditioning Predictive Models paper seem pretty dubious or at least overconfident to me.

Comment by ryan_greenblatt on Jimrandomh's Shortform · 2024-04-03T18:50:49.060Z · LW · GW

I'm in favor of logging everything forever in human accessible formats for other reasons. (E.g. review for control purposes.) Hopefully we can resolve safety privacy trade offs.

The proposal sounds reasonable and viable to me, though the fact that it can't be immediately explained might mean that it's not commercially viable.

Comment by ryan_greenblatt on The Case for Predictive Models · 2024-04-03T18:34:49.635Z · LW · GW

Reactions to the paper were mostly positive, but discussion was minimal and the ideas largely failed to gain traction. I suspect that muted reception was in part due to the size of the paper, which tried to both establish the research area (predictive models) and develop a novel contribution (conditioning them).

I think the proposed approach to safety doesn't make much sense and seems unlikely to be very useful direction. I haven't written up a review because it didn't seem like that many people were interested in pursuing this direction.

I think CPM does do somewhat interesting conceptual work with two main contributions:

  • It notes that "LLMs might generalize in a way which is reasonably well interpreted as conditioning and this be important and useful". I think this seems like one of the obvious baseline hypotheses for how LLMs (or similarly trained models) generalize and it seems good to point this out.
  • It notes various implications of this to varying degrees of speculativeness.

But, the actual safety proposal seems extremely dubious IMO.

Comment by ryan_greenblatt on EJT's Shortform · 2024-04-03T18:20:59.840Z · LW · GW

but this isn't what the shutdown problem is about so it isn't an issue if it doesn't apply directly to prosaic setups.

Fair enough, though the post itself does claim prosaic applications.

Comment by ryan_greenblatt on Sparsify: A mechanistic interpretability research agenda · 2024-04-03T18:18:56.208Z · LW · GW

It makes me a bit worried that this post seems to implicitly assume that SAEs work well at their stated purpose. This seems pretty unclear based on the empirical evidence and I would bet against.[1]

It also seems to assume that "superposition" and "polysemanticity" are good abstractions for understanding what's going on. This seems at least unclear to me, though it's probably at least partially true.

(Precisely, I would bet against "mild tweaks on SAEs will allow for interpretability researchers to produce succinct and human understandable explanations that allow for recovering >75% of the training compute of model components". Some operationalizations of these terms are explained here. I think people have weaker hopes for SAEs than this, but they're trickier to bet on.)

If I was working on this research agenda, I would be very interested in either:

  • Finding a downstream task that demonstrates that the core building block works sufficiently. It's unclear what this would be given the overall level of ambitiousness. The closest work thus far is this I think.
  • Demonstrating strong performance at good notions of "internal validity" like "we can explain >75% of the training compute of this tiny sub part of a realistic LLM after putting in huge amounts of labor" (>75% of training compute means that if you scaled up this methodology to the whole model you would get performance which is what you would get with >75% of the training compute used on the original model). Note that this doesn't correspond to reconstruction loss and instead corresponds to the performance of human interpretable (e.g. natural language) explanations.

  1. To be clear, the seem like a reasonable direction to explore and they very likely improve on the state of the art in at least some cases. It's just that they don't clearly work that well at an absolute level. ↩︎

Comment by ryan_greenblatt on Sparsify: A mechanistic interpretability research agenda · 2024-04-03T18:07:46.624Z · LW · GW

It seems worth noting that there are good a priori reasons to think that you can't do much better than around the "size of network" if you want a full explanation of the network's behavior. So, for models that are 10 terabytes in size, you should perhaps be expecting a "model manual" which is around 10 terabytes in size. (For scale this is around 10 million books as long as moby dick.)

Perhaps you can reduce this cost by a factor of 100 by taking advantage of human concepts (down to 100,000 moby dicks) and perhaps you can only implicitly represent this structure in a way that allow for lazy construction upon queries.

Or perhaps you don't think you need something which is close in accuracy to a full explanation of the network's behavior.

More discussion of this sort of consideration can be found here.

Comment by ryan_greenblatt on EJT's Shortform · 2024-04-02T18:20:35.127Z · LW · GW

If you read my solution and think it wouldn't work, let me know.

I don't know about "wouldn't help", but it doesn't seem to me on my initial skim to be a particularly promising approach for prosaic AI safety. I certainly don't agree with the first two bullets in the above list.

Apologies, but I've only skimmed your post, so it's possible I'm missing some important details.

I left a comment on your post explaining some aspects of why I think this.

Comment by ryan_greenblatt on The Shutdown Problem: Incomplete Preferences as a Solution · 2024-04-02T18:18:35.685Z · LW · GW

You commented elsewhere asking for feedback on this post. So, here is my feedback.

On my initial skim it doesn't seem to me like this approach is a particularly promising approach for prosaic AI safety. I have a variety of specific concerns. This is a somewhat timeboxed review, so apologies for any mistakes and lack of detail. I think a few parts of this review are likely to be confusing, but given time limitations, I didn't fix this.

A question

It's unclear to me what the expectation in Timestep Dominance is supposed to be with respect to. It doesn't seem like it can be with respect to the agent's subjective beliefs as this would make it even harder to impart. (And it's also unclear what exactly this should mean as the agent's subjective beliefs might be incoherant etc.)

If it's with respect to some idealized notion of the environment then the situation gets much messier to analyze because the agent will uncertain about whether one action is Timestep Dominated by another action. I think this notion of Timestep Dominance might more crippling than the subjective verion, thnough I'm unsure.

I think Timestep Dominance on subjective views and on the environment should behavior similarly in shutdown-ability, though it's a bit messy.

Imparting TD preferences seems hard

The prosaic version of this proposal assumes that you can impart timestep dominance preferences into AIs in ways which will robustly generalize. This seems unlikely to be true in general (in the absence of additional technology) and if we did have the property, we could solve safety issues in some other way (e.g. robustly generalizing honesty). So you'll need to argue that timestep dominance is particularly easy to robustly impart relative to other preferences. I can see some arguments for timestep dominance being particularly easy to robustly impart, but they don't seem very strong to me.

Naively it seems like you'll need massive generalization from toy environments to full real world cases. Such a level of generalization seems like it would also suffice for safety due to other mechanisms, e.g. honesty.

A further difficulty is that TD views often require seriously entertaining insane conditionals. E.g. conditionals in which humans didn't shut you down despite egregiously bad actions. Conditioning on extremely implausible situations gets you strange outcomes and it seems hard to construct AIs which do this.

But do TD-agents resist shutdown? (It might depend on their unspecified (non-TD) preferences as resisting shutdown will often have benefits for some shutdown timestep.)

I think it's natural for avoiding shutdown to be a side effect of actions that involve tricking, manipulating, or hiding things from humans. Further, doing these sorts of actions is likely to often be a good option when conditioning on some shutdown timestep. This techically falls under "incidental button manipulation", but I don't think the manipulation is well described as "incidental".

I think this is only a problem if we have non-TD preferences. If agents are actually indifferent aside from TD-type preferences, then there isn't an issue. (But also such agents are crippled by not investing.)

Here's an example:

Suppose that Clippy is a TD-agent which cares solely about the long run number of paperclips, but subject to a TD constraint. That is, it always prefers trajectory X over Y if for all shutdown timings X gives a >= number of paperclips than Y and X gives a > number of paperclips for some timing.

Now, consider the following proposed action for Clippy: "In a stealthy manner, construct a successor agent which pursues long run paperclips at all costs (and which isn't subject to a shutdown button, but still has TD constraints wrt. to the original Clippy's shutdown button)." This agent is implicitly resisting shutdown as it's constructing the sub-agent stealthily. So, shouldn't this action be Timestep Dominanted by "In whatever way is most expedient (but not necessarily stealthy), construct a successor agent". I claim not, because if we condition on the AI not being shutdown prior to constructing the agent (a particular shutdown time conditional), it will be worse for clippy if humans know that it is constructing a successor as then humans will have an easier time fighting the successor. In other words, having humans know that you're constructing a successor has two effects: they'll fight your successor and they'll shut you down. If we condition on not being shutdown, then they'll still fight your successor, so Clippy would prefer being stealthy in the no shut down conditional. Thus, being stealthy isn't Timestep Dominated by changing the action to avoid resisting shutdown. (And once the successor is constructed, shutting down Clippy has no effect on the long run number of paperclips (let's suppose).)

(You should think of this counterexample as a worst case theory counterexample. The exact assumptions might not be realistic, but it points at a broader class of issues.)

A related issue for the overall proposal is that it's extremely important what happens in very unlikely conditionals where we don't shutdown the AI despite seeing egregiously bad actions. The exact outcomes in these unlikely conditionals are crucial for the AI's behavior.

You might hope that we could do something like "with probability we unconditionally won't shutdown the AI. Also, in these worlds, we'll just let the AI do whatever it wants.". This way, the rare conditionals where the AI doesn't get shutdown get dominated by this. (E.g. we might hope that in the above Clippy example, in the case where we unconditionally don't do shutdown we'd just let the successor do whatever it wants. Thus, Clippy would prefer building a successor in an obvious way because conditioning on no-shutdown is conditioning on it being able to do whatever.) I think this approach has multiple serious issues, but the main thing is that if we're willing to entertain and use very unlikely hypotheticals where humanity commits to behaving in a very specific way and we can make AIs care about these conditionals, we can easily solve AI safety via another simpler mechanism. E.g., we can train our AI to "predict what exact actions we would have wanted the AI to do in the probability conditional where humanity commited to coordinating to not build AI prior to having a full solution to the alignment problem". (Of course, this isn't something you can actually get AIs to do, for similar reasons to why you can't actually impart TD preferences.)

Timestep dominance is maybe crippling

I'm most uncertain here, but my current guess would be that any sort of absolute constraint like this is crippling. I've thought through some cases and this is my current guess, but I'm by no means confident.

Comment by ryan_greenblatt on [April Fools' Day] Introducing Open Asteroid Impact · 2024-04-01T20:43:24.057Z · LW · GW

You might say "but there are clear historical cases where asteroids hit the earth and caused catastrophes", but I think geological evolution is just a really bad reference class for this type of thinking. After all, we are directing the asteroid this time, not geological evolution.

This paragraph gives me bad vibes. I feel like it's mocking particular people in an unhelpful way.

It doesn't feel to me like constructively poking fun at an argument and instead feels more like dunking on a strawman of the outgroup.

(I also have this complaint to some extent with the top level "Open Asteroid Impact" April Fool's post, though I think the vibes are much less bad for the overall post. This is due to various reasons I have a hard time articulating.)

Comment by ryan_greenblatt on Dagon's Shortform · 2024-04-01T19:13:27.150Z · LW · GW

On LW you can add a tag filter for April Fool's. So you can set that to hidden.

Comment by ryan_greenblatt on The Cognitive-Theoretic Model of the Universe: A Partial Summary and Review · 2024-03-29T17:42:38.961Z · LW · GW

How else will you train your AI?

Here are some other options which IMO reduce to a slight variation on the same thing or are unlikely to work:

  • Train your AI on predicting/imitating a huge amount of human output and then prompt/finetune the model to imitate humans philosophy and hope this works. This is a reasonable baseline, but I expect it to clearly fail to produce sufficiently useful answers without further optimization. I also think it's de facto optimizing for plausiblity to some extent due to properties of the human answer distribution.
  • Train your AI to give answers which sound extremely plausible (aka extremely likely to be right) in cases where humans are confident in the answers and then hope for generalization.
  • Train your AIs to give answers which pass various consistency checks. This reduces back to a particular notion of plausiblity.
  • Actually align your AI in some deep and true sense and ensure it has reasonably good introspective access. Then, just ask it questions. This is pretty unlikely to be technically feasible IMO, at least for the first very powerful AIs.
  • You can do something like train it with RL in an environment where doing good philosophy is instrumentally useful and then hope it becomes competent via this mechanism. This doesn't solve the elicitation problem, but could in principle ensure the AI is actually capable. Further, I have no idea what such an environment would look like if any exists. (There are clear blockers to using this sort of approach to evaluate alignment work without some insane level of simulation, I think similar issues apply with philosophy.)

Ultimately, everything is just doing some sort of optimization for something like "how good do you think it is" (aka plausibility). For instance, I do this while thinking of ideas. So, I don't really think this is avoidable at some level. You might be able to avoid gaps in abilities between the entity optimizing and the entity judging (as is typically the case in my brain) and this solves some of the core challenges TBC.

Comment by ryan_greenblatt on My Interview With Cade Metz on His Reporting About Slate Star Codex · 2024-03-27T20:49:45.490Z · LW · GW

The benefits are transparency about who is influencing society

In this particular case, I don't really see any transparency benefits. If it was the case that there was important public information attached to Scott's full name, then this argument would make sense to me.

(E.g. if Scott Alexander was actually Mark Zuckerberg or some other public figure with information attacked to their real full name then this argument would go through.)

Fair enough if NYT needs to have a extremely coarse grained policy where they always dox influential people consistently and can't do any cost benefit on particular cases.

Comment by ryan_greenblatt on Modern Transformers are AGI, and Human-Level · 2024-03-27T03:30:29.475Z · LW · GW

Oh, by "as qualitatively smart as humans" I meant "as qualitatively smart as the best human experts".

I also maybe disagree with:

In terms of "fast and cheap and comparable to the average human" - well, then for a number of roles and niches we're already there.

Or at least the % of economic activity covered by this still seems low to me.

Comment by ryan_greenblatt on Modern Transformers are AGI, and Human-Level · 2024-03-26T22:23:19.065Z · LW · GW

Superintelligence

To me, superintelligence implies qualitatively much smarter than the best humans. I don't think this is needed for AI to be transformative. Fast and cheap-to-run AIs which are as qualitatively smart as humans would likely be transformative.

Comment by ryan_greenblatt on Modern Transformers are AGI, and Human-Level · 2024-03-26T18:40:04.198Z · LW · GW

I propose that LLMs cannot do things in this category at human level, as of today—e.g. AutoGPT basically doesn’t work, last I heard. And this category of capability isn’t just a random cherrypicked task, but rather central to human capabilities, I claim.

What would you claim is a central example of a task which requires this type of learning? ARA type tasks? Agency tasks? Novel ML research? Do you think these tasks certainly require something qualitatively different than a scaled up version of what we have now (pretraining, in-context learning, RL, maybe training on synthetic domain specific datasets)? If so, why? (Feel free to not answer this or just link me what you've written on the topic. I'm more just reacting than making a bid for you to answer these questions here.)

Separately, I think it's non-obvious that you can't make human-competitive sample efficient learning happen in many domains where LLMs are already competitive with humans in other non-learning ways by spending massive amounts of compute doing training (with SGD) and synthetic data generation. (See e.g. efficient-zero.) It's just that the amount of compute/spend is such that you're just effectively doing a bunch more pretraining and thus it's not really an interestingly different concept. (See also the discussion here which is mildly relevant.)

In domains where LLMs are much worse than typical humans in non-learning ways, it's harder to do the comparison, but it's still non-obvious that the learning speed is worse given massive computational resources and some investment.

Comment by ryan_greenblatt on Modern Transformers are AGI, and Human-Level · 2024-03-26T18:15:18.085Z · LW · GW

I think this mostly just reveals that "AGI" and "human-level" are bad terms.

Under your proposed usage, modern transformers are (IMO) brutally non-central with respect to the terms "AGI" and "human-level" from the perspective of most people.

Unfortunately, I don't think there is any defintion of "AGI" and "human-level" which:

  • Corresponds to the words used.
  • Also is central from the perspective of most people hearing the words

I prefer the term "transformative AI", ideally paired with a definition.

(E.g. in The case for ensuring that powerful AIs are controlled, we use the terms "transformatively useful AI" and "early tranformatively useful AI" both of which we define. We were initially planning on some term like "human-level", but we ran into a bunch of issues with using this term due to wanting a more precise concept and thus instead used a concept like not-wildly-qualitatively-superhuman-in-dangerous-domains or non-wildly-qualitatively-superhuman-in-general-relevant-capabilities.)

I should probably taboo human-level more than I currently do, this term is problematic.

Comment by ryan_greenblatt on All About Concave and Convex Agents · 2024-03-25T22:29:55.952Z · LW · GW

I see the intuition here, but I think the actual answer on how convex agents behave is pretty messy and complicated for a few reasons:

  • Otherwise convex agents might act as though resources are bounded. This could be because they assign sufficiently high probability to literally bounded universes or because they think that value should be related to some sort of (bounded) measure
  • More generally, you can usually only play lotteries if there is some agent to play a lottery with. If the agent can secure all the resources that would otherwise be owned by all other agents, then there isn't any need for (further) lotteries. (And you might expect convex agents to maximize the probability of this sort of outcome.)
  • Convex agents (and even linear agents) might be dominated by some possiblity of novel physics or similarly large breakthrough opening up massive amounts of resources. In this case, it's possible that the optimal move is something like securing enough R&D potential (e.g. a few galaxies of resources) that you're well past diminishing returns on hitting this, but you don't necessarily otherwise play lotteries. (It's a bit messier if there is competition for the resources opened up by novel physics.
  • Infinite ethics.
Comment by ryan_greenblatt on On the Confusion between Inner and Outer Misalignment · 2024-03-25T19:40:52.255Z · LW · GW

I think the core confusion is that outer/inner (mis)-alignment have different (reasonable) meanings which are often mixed up:

  • Threat models: outer misalignment and inner misalignment.
  • Desiderata sufficient for a particular type of proposal for AI safety: For a given AI, solve outer alignment and inner alignment (sufficiently well for a particular AI and deployment) and then combine these solutions to avoid misalignment issues.

The key thing is that threat models are not necessarily problems that need to be solved directly. For instance, AI control aims to address the threat model of inner misalignment without solving inner alignment.

Definition as threat models

Here are these terms defined as threat models

  • Outer misalignment is the threat model where catastrophic outcomes result from providing problematic reward signals to the AI. This includes cases where problematic feedback results in catastrophic generalization behavior as described in “Without specific countermeasures…[1], but also slow-rolling catastrophe due to direct incentives from reward as discussed in “What failure looks like: You get what you measure”.
  • Inner misalignment is the threat model where catastrophic outcomes happen regardless of whether the reward process is problematic due to the AI pursuing undesirable objectives in at least some cases. This could be due to scheming (aka deceptive alignment), problematic goals which correlate with reward (aka proxy alignment with problematic generalization), or threat models more like deep deceptiveness.

This seems like a pretty reasonable decomposition of problems to me, but again note that these problems don't have to respectively be solved by inner/outer alignment "solutions".

Definition as a particular type of AI safety proposal

This proposal has two necessary desiderata:

  • Outer alignment: a reward provision process such that sufficiently good outcomes would occur if our actual AI maximized this reward to the best of its abilities[2]. Note that we only care about maximization given a specific AI’s actual capabilities and affordances, the process doesn’t need to be robust to arbitrary optimization. As such, outer alignment is with respect to a particular AI and how it is used[3]. (See also the notion of local adequacy we define here.)
  • Inner alignment: a process which sufficiently ensures that an AI actually does robustly “try”[4] to maximize a given reward provision process (from the prior step) including in novel circumstances[5].

This seems like generally reasonable overall proposal, though there are alternatives. And the caveats around outer alignment only needing to be locally adequate are important.

Doc with more detail

This content is copied out of this draft which I've never gotten around to cleaning up and publishing.

  1. ^

     The outer misalignment threat model covers cases where problematic feedback results in training a misaligned AI even if the actual oversight process used for training would actually have caught the catastrophically bad behavior if it was applied to this action.

  2. ^

    For AIs which aren’t well described as pursuing goals, it’s sufficient for the AI to just be reasonably well optimized to perform well according to this reward provision process. However, note that AIs which aren't well described as pursuing goals also likely pose no misalignment risk.

  3. ^

    Prior work hasn’t been clear about outer alignment solutions just needing to be robust to a particular AI produced by some training process, but this seems extremely key to a reasonable definition of the problem from my perspective. This is both because we don’t need to be arbitrarily robust (the key AIs will only be so smart, perhaps not even smarter than humans) and because approaches might depend on utilizing the AI itself in the oversight process (recursive oversight) such that they’re only robust to that AI but not others. For an example of recursive oversight being robust to a specific AI, consider ELK type approaches. ELK could be applicable to outer alignment via ensuring that a human overseer is well informed about everything a given AI knows (but not necessarily well informed about everything any AI could know).Os

  4. ^

    Again, this is only important insofar as AIs are doing anything well described as "trying" in any cases.

  5. ^

    In some circumstances, it’s unclear exactly what it would even mean to optimize a given reward provision process as the process is totally inapplicable to the novel circumstances. We’ll ignore this issue.

Comment by ryan_greenblatt on Leading The Parade · 2024-03-24T18:14:22.165Z · LW · GW

Shapley seems like quite an arbitrary choice (why uniform over all coalitions?).

I think the actually mathematically right thing is just EDT/UDT, though this doesn't imply a clear notion of credit. (Maximizing shapley yields crazy results.)

Unfortunately, I don't think there is a correct notion of credit.

Comment by ryan_greenblatt on AI #56: Blackwell That Ends Well · 2024-03-21T16:09:38.615Z · LW · GW

On mamba, I explain why Nora is right in this comment

Comment by ryan_greenblatt on D0TheMath's Shortform · 2024-03-20T18:22:13.343Z · LW · GW

IMO robin is quite repetitive (even relative to other blogs like Scott Alexander's blog). So the quality is maybe the same, but the marginal value add seems to me to be substantially degrading.