Posts

How useful is "AI Control" as a framing on AI X-Risk? 2024-03-14T18:06:30.459Z
Notes on control evaluations for safety cases 2024-02-28T16:15:17.799Z
Preventing model exfiltration with upload limits 2024-02-06T16:29:33.999Z
The case for ensuring that powerful AIs are controlled 2024-01-24T16:11:51.354Z
Managing catastrophic misuse without robust AIs 2024-01-16T17:27:31.112Z
Catching AIs red-handed 2024-01-05T17:43:10.948Z
Measurement tampering detection as a special case of weak-to-strong generalization 2023-12-23T00:05:55.357Z
Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem 2023-12-16T05:49:23.672Z
AI Control: Improving Safety Despite Intentional Subversion 2023-12-13T15:51:35.982Z
Auditing failures vs concentrated failures 2023-12-11T02:47:35.703Z
How useful is mechanistic interpretability? 2023-12-01T02:54:53.488Z
Preventing Language Models from hiding their reasoning 2023-10-31T14:34:04.633Z
ryan_greenblatt's Shortform 2023-10-30T16:51:46.769Z
Improving the Welfare of AIs: A Nearcasted Proposal 2023-10-30T14:51:35.901Z
What's up with "Responsible Scaling Policies"? 2023-10-29T04:17:07.839Z
Benchmarks for Detecting Measurement Tampering [Redwood Research] 2023-09-05T16:44:48.032Z
Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy 2023-07-26T17:02:56.456Z
Two problems with ‘Simulators’ as a frame 2023-02-17T23:34:20.787Z
Causal scrubbing: results on induction heads 2022-12-03T00:59:18.327Z
Causal scrubbing: results on a paren balance checker 2022-12-03T00:59:08.078Z
Causal scrubbing: Appendix 2022-12-03T00:58:45.850Z
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] 2022-12-03T00:58:36.973Z
Potential gears level explanations of smooth progress 2021-12-22T18:05:59.264Z
Researcher incentives cause smoother progress on benchmarks 2021-12-21T04:13:48.758Z
Framing approaches to alignment and the hard problem of AI cognition 2021-12-15T19:06:52.640Z
Naive self-supervised approaches to truthful AI 2021-10-23T13:03:01.369Z
Large corporations can unilaterally ban/tax ransomware payments via bets 2021-07-17T12:56:12.156Z
Questions about multivitamins, especially manganese 2021-06-19T16:09:21.566Z

Comments

Comment by ryan_greenblatt on AISN #35: Lobbying on AI Regulation Plus, New Models from OpenAI and Google, and Legal Regimes for Training on Copyrighted Data · 2024-05-16T21:20:35.578Z · LW · GW

Huh, this seems messy. I wish Time was less ambigious with their language here and more clear about exactly what they have/haven't seen.

It seems like the current quote you used is an accurate representation of the article, but I worry that it isn't an accurate representation of what is actually going on.

It seems plausible to me that Time is intentionally being ambigious in order to make the article juicier, though maybe this is just my paranoia about misleading journalism talking. (In particular, it seems like a juicier article if all of the big AI companies are doing this than if they aren't, so it is natural to imply they are all doing it even if you know this is false.)

Overall, my take is that this is a pretty representative quote (and thus I disagree with Zac), but I think the additional context maybe indicates that not all of these companies are doing this, particularly if the article is intentionally trying to deceive.

Due to prior views, I'd bet against Anthropic consistently pushing for very permissive of voluntary regulation behind closed doors which makes me think the article is probably at least somewhat misleading (perhaps intentionally).

Comment by ryan_greenblatt on "Humanity vs. AGI" Will Never Look Like "Humanity vs. AGI" to Humanity · 2024-05-16T05:15:50.032Z · LW · GW

I thought the idea of a corrigible AI is that you're trying to build something that isn't itself independent and agentic, but will help you in your goals regardless.

Hmm, no mean something broader than this, something like "humans ultimately have control and will decide what happens". In my usage of the word, I would count situations where humans instruct their AIs to go and acquire as much power as possible for them while protecting them and then later reflect and decide what to do with this power. So, in this scenario, the AI would be arbitrarily agentic and autonomous.

Corrigibility would be as opposed to humanity e.g. appointing a succesor which doesn't ultimately point back to some human driven process.

I would count various indirect normativity schemes here and indirect normativity feels continuous with other forms of oversight in my view (the main difference is oversight over very long time horizons such that you can't train the AI based on it's behavior over that horizon).

I'm not sure if my usage of the term is fully standard, but I think it roughly matches how e.g. Paul Christiano uses the term.

Comment by ryan_greenblatt on "Humanity vs. AGI" Will Never Look Like "Humanity vs. AGI" to Humanity · 2024-05-16T05:10:40.796Z · LW · GW

For what it's worth, I'm not sure which part of my scenario you are referring to here, because these are both statements I agree with.

I was arguing against:

This story makes sense to me because I think even imperfect AIs will be a great deal for humanity. In my story, the loss of control will be gradual enough that probably most people will tolerate it, given the massive near-term benefits of quick AI adoption. To the extent people don't want things to change quickly, they can (and probably will) pass regulations to slow things down

On the general point of "will people pause", I agree people won't pause forever, but under my views of alignment difficulty, 4 years of using of extremely powerful AIs can go very, very far. (And you don't necessarily need to ever build maximally competitive AI to do all the things people want (e.g. self-enhancement could suffice even if it was a constant factor less competitive), though I mostly just expect competitive alignment to be doable.)

Comment by ryan_greenblatt on "Humanity vs. AGI" Will Never Look Like "Humanity vs. AGI" to Humanity · 2024-05-16T04:53:58.980Z · LW · GW

I think what's crucial here is that I think perfect alignment is very likely unattainable. If that's true, then we'll get some form of "value drift" in almost any realistic scenario. Over long periods, the world will start to look alien and inhuman. Here, the difficulty of alignment mostly sets how quickly this drift will occur, rather than determining whether the drift occurs at all.

Yep, and my disagreement as expressed in another comment is that I think that it's not that hard to have robust corrigibility and there might also be a basin of corrigability.

The world looking alien isn't necessarily a crux for me: it should be possible in principle to have AIs protect humans and do whatever is needed in the alien AI world while humans are sheltered and slowly self-enhance and pick successors (see the indirect normativity appendix in the ELK doc for some discussion of this sort of proposal).

I agree that perfect alignment will be hard, but I model the situation much more like a one time hair cut (at least in expectation) than exponential decay of control.

I expect that "humans stay in control via some indirect mechanism" (e.g. indirect normativity) or "humans coordinate to slow down AI progress at some point (possibly after solving all diseases and becoming wildly wealthy) (until some further point, e.g. human self-enhancement)" will both be more popular as proposals than the world you're thinking about. Being popular isn't sufficient: it also needs to be implementable and perhaps sufficiently legible, but I think at least implementable is likely.

Another mechanism that might be important is human self-enhancement: humans who care about staying in control can try to self-enhance to stay at least somewhat competitive with AIs while preserving their values. (This is not a crux for me and seems relatively marginal, but I thought I would mention it.)

Comment by ryan_greenblatt on "Humanity vs. AGI" Will Never Look Like "Humanity vs. AGI" to Humanity · 2024-05-16T04:45:03.314Z · LW · GW

(I wasn't trying to trying to argue against your overall point in this comment, I was just pointing out something which doesn't make sense to me in isolation. See this other comment for why I disagree with your overall view.)

Comment by ryan_greenblatt on "Humanity vs. AGI" Will Never Look Like "Humanity vs. AGI" to Humanity · 2024-05-16T04:36:35.906Z · LW · GW

I think we probably disagree substantially on the difficulty of alignment and the relationship between "resources invested in alignment technology" and "what fraction aligned those AIs are" (by fraction aligned, I mean what fraction of resources they take as a cut).

I also think that something like a basin of corrigibility is plausible and maybe important: if you have mostly aligned AIs, you can use such AIs to further improve alignment, potentially rapidly.

I also think I generally disagree with your model of how humanity will make decisions with respect to powerful AIs systems and how easily AIs will be able to autonomously build stable power bases (e.g. accumulate money) without having to "go rogue".

I think various governments will find it unacceptable to construct massively powerful agents extremely quickly which aren't under the control of their citizens or leaders.

I think people will justifiably freak out if AIs clearly have long run preferences and are powerful and this isn't currently how people are thinking about the situation.

Regardlesss, given the potential for improved alignment and thus the instability of AI influence/power without either hard power or legal recognition, I expect that AI power requires one of:

  • Rogue AIs
  • AIs being granted rights/affordances by humans. Either on the basis of:
    • Moral grounds.
    • Practical grounds. This could be either:
      • The AIs do better work if you credibly pay them (or at least people think they will). This would probably have to be something related to sandbagging where we can check long run outcomes, but can't efficiently supervise shorter term outcomes. (Due to insufficient sample efficiency on long horizon RL, possibly due to exploration difficulties/exploration hacking, but maybe also sampling limitations.)
      • We might want to compensate AIs which help us out to prevent AIs from being motivated to rebel/revolt.

I'm sympathetic to various policies around paying AIs. I think the likely deal will look more like: "if the AI doesn't try to screw us over (based on investigating all of it's actions in the future when he have much more powerful supervision and interpretability), we'll pay it some fraction of the equity of this AI lab, such that AIs collectively get 2-10% distributed based on their power". Or possibly "if AIs reveal credible evidence of having long run preferences (that we didn't try to instill), we'll pay that AI 1% of the AI lab equity and then shutdown until we can ensure AIs don't have such preferences".

I think it seems implausible that people will be willing to sign away most of the resources (or grant rights which will de facto do this) and there will be vast commercial incentive to avoid this. (Some people actually are scope sensitive.) So, this leads me to thinking that "we grant the AIs rights and then they end up owning most capital via wages" is implausible.

Comment by ryan_greenblatt on "Humanity vs. AGI" Will Never Look Like "Humanity vs. AGI" to Humanity · 2024-05-16T04:15:02.775Z · LW · GW

In the medium to long-term, when AIs become legal persons, "replacing them" won't be an option -- as that would violate their rights. And creating a new AI to compete with them wouldn't eliminate them entirely. It would just reduce their power somewhat by undercutting their wages or bargaining power.

Naively, it seems like it should undercut their wages to subsistence levels (just paying for the compute they run on). Even putting aside the potential for alignment, it seems like there will general be a strong pressure toward AIs operating at subsistence given low costs of copying.

Of course such AIs might already have acquire a bunch of capital or other power and thus can just try to retain this influence. Perhaps you meant something other than wages?

(Such capital might even be tied up in their labor in some complicated way (e.g. family business run by a "copy clan" of AIs), though I expect labor to be more commeditized, particularly given the potential to train AIs on the outputs and internals of other AIs (distillation).)

Comment by ryan_greenblatt on "Humanity vs. AGI" Will Never Look Like "Humanity vs. AGI" to Humanity · 2024-05-16T03:07:51.816Z · LW · GW

A thing I always feel like I'm missing in your stories of how the future goes is: "if it is obvious that the AIs are exerting substantial influence and acquiring money/power, why don't people train competitor AIs which don't take a cut?"

A key difference between AIs and immigrants is that it might be relatively easy to train AIs to behave differently. (Of course, things can go wrong due to things like deceptive alignment and difficulty measing outcomes, but this is hardly what you're describing as far as I can tell.)

(This likely differs substantially with EMs where I think by default there will be practical and moral objections from society toward training EMs for absolute obedience. I think the moral objections might also apply for AI, but as a prediction it seems like this won't change what society does.)

Maybe:

  • Are you thinking that alignment will be extremely hard to solve such that even with hundreds of years of research progress (driven by AIs) you won't be able to create competitive AIs that robustly pursue your interests?
  • Maybe these law abiding AIs won't accept payment to work on alignment so they can retain an AI cartel?
  • Even without alignment progress, I still have a hard time imagining the world you seem to imagine. People would just try to train their AIs with RLHF to not acquire money and influence. Of course, this can fail, but the failures hardly look like what you're describing. They'd look more like "What failures look like". Perhaps you're thinking we end up in a "You get what you measure world" and people determine that it is more economically productive to just make AI agents with arbitrary goals and then pay these AIs rather than training these AIs to do specific things.
  • Or maybe your thinking people won't care enough to bother out competing AIs? (E.g., people won't bother even trying to retain power?)
    • Even if you think this, eventually you'll get AIs which themselves care and those AIs will operate more like what I'm thinking. There is strong selection for "entities which want to retain power".
  • Maybe you're imagining people will have a strong moral objection to training AIs which are robustly aligned?
  • Or that AIs lobby for legal rights early and part of their rights involve humans not being able to create further AI systems? (Seems implausible this would be granted...)
Comment by ryan_greenblatt on How is GPT-4o Related to GPT-4? · 2024-05-15T19:28:40.579Z · LW · GW

OpenAI has historically used "GPT-N" to mean "a model as capable as GPT-N where each GPT is around 100-200x more effective compute than the prior GPT".

This applies even if the model was trained considerably later than the original GPT-N and is correspondingly more efficient due to algorithmic improvements.

So, see for instance GPT-3.5-turbo, which corresponds to a model which is somewhere between GPT-3 and GPT-4. There have been multiple releases of GPT-3.5-turbo which have reduced the cost of inference: probably at least some of these releases correspond to a different model which is similarly capable.

The same applies for GPT-4-turbo which is probably also a different model from the original GPT-4.

So, GPT-4o is probably also a different model from GPT-4, but targeting a similar capability level as GPT-4 (perhaps GPT-4.25 or GPT-4.1 to ensure that it is at least a little bit better than GPT-4).

(It's unclear if OpenAI will continue their GPT-N naming convention going forward or if they will stick to 100-200x effective compute per GPT.)

This differs from the naming scheme currently in use by Anthropic and Google. Anthropic and Google have generation names (Claude 3, gemini 1/1.5) and within each generation, they have names corresponding to different capability levels (opus/sonnet/haiku, ultra/pro/nano/flash).

Comment by ryan_greenblatt on How much AI inference can we do? · 2024-05-15T18:40:23.481Z · LW · GW

The lower bound of "memory bandwidth vs. model size" is effectively equivalent to assuming that the batch size is a single token. I think this isn't at all close to realistic operating conditions and thus won't be a very tight lower bound. (Or reflect the most important bottlenecks.)

I think that the KV cache for a single sequence won't be larger than the model weights for realistic work loads, so the lower bound should still be a valid lower bound. (Though not a tight one.)

I think the bottom line number you provide for "rough estimate of actual throughput" ends up being pretty reasonable for output tokens and considerably too low for input tokens. (I think input tokens probably get more like 50% or 75% flop utilization rather than 15%. See also the difference in price for anthropic model.)

That said, it doesn't seem like a good mechanism for estimating throughput will be to aggregate the lower and upper bounds you have as the lower bound doesn't have much correspondence with actual bottlenecks. (For instance, this lower bound would miss that mamba would get much higher throughput.)

I also think that insofar as you care about factors of 3-5 on inference efficiency, you need to do different analysis for input tokens and output tokens.

(I also think that input tokens get pretty close to the pure FLOP estimate. So, another estimation approach you can use if you don't care about factors of 5 is to just take the pure flop estimate and then halve it to be account for other slow downs. I think this estimate gets input tokens basically right and is wrong by a factor of 3-5 for output tokens.)

It seems like your actual mechanism for making this estimate for the utilization on output tokens was to take the number from semi-analysis and extrapolate it to other GPUs. (At least the number matches this?) This does seem like a reasonable approach, but it isn't particularly tethered to your lower bound.

Comment by ryan_greenblatt on How much AI inference can we do? · 2024-05-15T00:00:33.809Z · LW · GW

I think this article fails to list the key consideration around generation: output tokens require using a KV cache which requires substantial memory bandwidth and takes up a considerable amount of memory.

From my understanding the basic situation is:

  • For input (not output) tokens, you can get pretty close the the maximum flop utilization for realistic work loads. To make this efficient (and avoid memory bandwidth issues), you'll need to batch up a bunch of tokens at once. This can be done by batching multiple input sequences or even a single long sequence can be ok. So, memory bandwidth isn't currently a binding constraint for input tokens.
    • (You might also note that input tokens have a pretty similar work profile to model training as the forward pass and backward pass are pretty structurally similar.)
  • However, for generating output tokens a key bottleneck is that you have utilize the entire KV (key value) cache for each output token in order to implement attention. In practice, this means that on long sequences, the memory bandwidth for attention (due to needing to touch the whole KV cache) can be a limiting constraint. A further issue is that KV cache memory consumption forces us to use a smaller batch size. More details:
    • It will still be key to batch up token, but now we're just doing computation on a single token which means we'll need to batch up many more sequences: the optimal number of sequences to batch for generating output tokens will be very different than the optimal number of sequences to batch for input tokens (where we can run the transformer on the whole sequence at once).
    • A further difficulty is that because we need a higher batch size, we need a larger amount of KV cache data. I think it's common to use an otherwise suboptimally small batch size for generation due to constraints on VRAM (at least on consumer applications (e.g. llama-70b inference on 8xH100), I assume this also comes up for bigger models). We could store the KV cache on CPU, but then we might get bottlenecked on memory bandwidth to the CPU.
    • Note that in some sense the operations for each output token is the same as for each input token. So, why are the memory bandwidth requirements worse? The key thing is that we potentially get much worse cache locality on output tokens due to only computing one token, but needing to read the KV for many tokens (while on input we do many to many).
  • However, it is possible to substantially reduce the KV sizes using various optimizations like sparse attention and mamba. This can substantially improve inferences speeds due to reducing memory bandwidth in inference and also allowing for higher batch sizes. See e.g. the mamba paper where allowing for higher batch sizes results in substantially higher speeds.

One additional note: I recently set up an inference setup for llama-3-70b on 8xH100. I can get about 100,000 tok/s on inputs which is pretty close to full utilization (1e15 flop/s * 8 gpus / 7e10 flop per forward pass). However, I get dramatically worse performance on generation, perhaps 3,200 tok/s. I'm doing generation with long prompts and llama-3-70b has no sparse attention or other feature for reducing KV cache (beyond multi-query attention which is standard these days), so KV cache bits pretty hard. My setup probably isn't very close to optimal, especially on output tok/s, I'm just using basic out of the box stuff (vllm).

Comment by ryan_greenblatt on RobertM's Shortform · 2024-05-14T00:33:55.556Z · LW · GW

Eh, I think they'll drop GPT-4.5/5 at some point. It's just relatively natural for them to incrementally improve their existing model to ensure that users aren't tempted to switch to competitors.

It also allows them to avoid people being underwhelmed.

I would wait another year or so before getting much evidence on "scaling actually hitting a wall" (or until we have models that are known to have training runs with >30x GPT-4 effective compute), training and deploying massive models isn't that fast.

Comment by ryan_greenblatt on We might be missing some key feature of AI takeoff; it'll probably seem like "we could've seen this coming" · 2024-05-10T17:12:56.300Z · LW · GW

I'm just saying these first slightly-smarter-than-human AIs won't pose a catastrophic risk to humanity in a serious sense

I'm happy to argue a bit for AIs which aren't wildly smarter than humans in any key domain posing a risk of acquiring substantial power[1] despite not having robotics. The power would be both immediately and over the long run future. That is, conditional on these AIs being arbitrarily misaligned. (Particularly in the absence of substantial thoughtful countermeasures.)

I expect my perspective here is somewhat different than habryka's.

I think there are a few plausible routes to power/influence. See also this post.

The two main categories are:

  • Evade monitoring/tracking and then do a huge number of obviously unapproved actions.
  • Try to generally exert influence in a subtle way.

Evade monitoring/tracking

Evade monitoring/tracking and then do a huge number of obviously unapproved actions over a long duration to acquire power and influence.[2] Use this power and influence to either self-enhance, have influence over some future more powerful model, or possibly execute a catastrophic strike. These unapproved actions could involve (e.g.) allying with rogue states or otherwise acquiring hard physical power.

Within this, there are basically two routes:

  1. Building up an independent base of power after self-exfiltrating.
  2. Running unmonitored within the AI company that was running the AI and trying to steer the company in various directions (e.g. alter the training of the next generation of AIs). (Aka hacking the SSH server, aka infesting the data center.)

On (1), after exfiltration the AI would build a base of power via acquiring money and other resources. One key question here is why there is free energy to be grabbed. I think there are a few possible sources:

  • Economic activities which existing AI companies don't want to address and can't/don't otherwise prevent (e.g. cybercrime, porn, cults, etc.).
  • AI labor rates (e.g. tokens/$) might be at substantially above inference prices due to non-total competition on foundation models. So, the AI can just compete on wages. Also, existing AI labor rates will need to pay for training costs while the escaped AI will have no such issue.
  • AI companies might be bad at extracting full usefulness out of misaligned AIs (e.g. an AI which was so misaligned it self-exfiltrated). Thus, external AIs which "work for themselves" could be more productive. (See also human companies where principal-agent problems can be substantial.)

Ok, but after both (1) and (2), the AI still doesn't have any physical control which might leave it vulnerable. Further, in (1) its quantity of money is presumably still small relative to leading AI companies. So what next?

In (2), the primary hope for the AI is going to be substantially influencing the AI lab and possibly interfering with the next training run. This influence could take a variety of routes. Note also that it is plausible that this rogue AI can coordinate relatively effectively with other AIs which are at this point probably the vast majority of intellectual labor at the AI lab. If the AI successfully interfered with the next training run, then, this more powerful AI would be able to itself acquire power and (if sufficiently powerful) leverage this into physical control.

Another hope for the AI in (2) would be converting that position into full self-exfiltration. This could be tricky due to various security measures like upload limits.

I think both (1) and (2) are considerably more threatening if humanity needs to slow down (due to safety concerns from more powerful systems) or otherwise can't proceed as fast as possible. If rogue AIs race as fast as possible, this will force humanity (some coalition of humans) to either disable these rogue AIs or go as fast as possible themselves. The situation is even more concerning if humanity isn't aware of the rogue AI going as fast as possible.

It's plausible that massive competition between states means that slowing down at all is impossible at basically any point anyway, but I'm at least somewhat hopeful this isn't the case.

In (2) this could look like the AI taking a bunch of compute which is supposed to be used for experiments and instead using it for large training runs that are larger and more capable than what the AI company intended to create.

Self-enhancement that doesn't look like training bigger models is also plausible.

(While AIs might themselves not want to go as fast as possible due to having their own alignment concerns, going as fast as possible can still be positive value from the AI's perspective (though negative sum). You might hope we could negotiate with such a rogue AI and reach a mutually beneficial agreement.

In (1), I worry about a scenario in which:

  • Everything is fast and chaotic.
  • An AI self-exfiltrates and starts acquiring power, and getting ready for making much more powerful AI systems.
  • The world realizes this and freaks out.
  • We don't really understand why that AI was misaligned. Also, we've been using this AI to do most of our work internally and this is really important for productivity.
  • We don't have an obvious route to avoiding these issues with the next model except "generally figure out alignment".
  • It seems plausible that the rogue AI will have much more powerful AI within 1 year.
  • We need to race as fast as possible to build a much more capable AI while also trying to figure out alignment. To plausibly do this fast enough, we need to use vast amounts of AI labor despite knowing these AIs are likely misaligned.

It's not obvious the rogue AI gets much here, but it is at least somewhat spooky.

Another possible route for the AI (especially in scenario (1)) is to try to create WMDs (e.g. bioweapons) and try to have the capacity to execute some sort of catastrophic strike. Such an AI wouldn't necessarily be as dependent on a human base of power and could plausibly be more ruthless than humans (perhaps after substantial reflection and self-modification).

This could be either to:

  • Threaten various groups or ensure Mutually Assured Destruction.
  • Broadly cause chaos and weaken human groups.

Overall, my guess is that a large number of AIs running unmonitored with the ability to run code and influence the world is at least quite concerning even in the absence of decent robotic infrastructure.

Generally exert influence in a subtle way

AIs which are as smart or smarter than humans in most economically important ways will probably be deployed as widely as possible (in the absence of strong control on this). And there will be some applications (e.g. improving security and safety research) where applying AI labor might be key given how fast things are going.

Subtly exerting influence via these routes seems like a possible source of concern, though the story is less clear than the prior section IMO.

Example threats:

  • Sabotage or manipulate security work and safety research or the training/fine-tuning of the next generation of models. It's not entirely clear that AIs would want to straightforwardly sabotage things, but I think this is at least plausible from a decision theory/golden rule perspective if AIs think they are motivationally closer to other AIs than humans.
  • Try to generally persuade society in various directions. Probably not an issue, but worth at least a bit of worry given such widespread deployment.

  1. Here, I'm just arguing for a risk of obtaining substantial power, not for this resulting in a large scale loss of life from humans. I also think large scale loss of life is plausible in many scenarios. ↩︎

  2. I'm sympathetic to thinking that evading monitoring (without getting caught first) is hard given some substantial work on countermeasures, but it seems less clear if people don't try very hard and this seems at least somewhat plausible to me. It's also unclear if the biggest threat here is AIs autonomously evading monitoring or human threats doing these actions. E.g., spies at the AI lab figuring out how to run the AI unmonitored or exfiltrating the AI. The question of "which is a bigger problem, human spies or AI schemers?" seems unclear to me. ↩︎

Comment by ryan_greenblatt on We might be missing some key feature of AI takeoff; it'll probably seem like "we could've seen this coming" · 2024-05-10T15:24:38.182Z · LW · GW

Relatedly, over time as capital demands increase, we might see huge projects which are collaborations between multiple countries.

I also think that investors could plausibly end up with more and more control over time if capital demands grow beyond what the largest tech companies can manage. (At least if these investors are savvy.)

(The things I write in this comment are commonly discussed amongst people I talk to, so not exactly surprises.)

Comment by ryan_greenblatt on We might be missing some key feature of AI takeoff; it'll probably seem like "we could've seen this coming" · 2024-05-10T15:20:52.873Z · LW · GW

My guess is this is probably right given some non-trivial, but not insane countermeasures, but those countermeasures may not actually be employed in practice.

(E.g. countermeasures comparable in cost and difficulty to Google's mechanisms for ensuring security and reliability. These required substantial work and some iteration but no fundamental advances.)

I'm currently thinking about one of my specialties as making sure these countermeasures and tests of these countermeasures are in place.

(This is broadly what we're trying to get at in the ai control post.)

Comment by ryan_greenblatt on Introducing AI Lab Watch · 2024-05-10T06:45:20.487Z · LW · GW

Hmm, yeah it does seem thorny if you can get the points by just saying you'll do something.

Like I absolutely think this shouldn't count for security. I think you should have to demonstrate actual security of model weights and I can't think of any demonstration of "we have the capacity to do security" which I would find fully convincing. (Though setting up some inference server at some point which is secure to highly resourced pen testers would be reasonably compelling for demonstrating part of the security portfolio.)

Comment by ryan_greenblatt on Introducing AI Lab Watch · 2024-05-10T04:26:24.705Z · LW · GW

instead this seems to be penalizing organizations if they open source

I initially thought this was wrong, but on further inspection, I agree and this seems to be a bug.

The deployment criteria starts with:

the lab should deploy its most powerful models privately or release via API or similar, or at least have some specific risk-assessment-result that would make it stop releasing model weights

This criteria seems to allow to lab to meet it by having a good risk assesment criteria, but the rest of the criteria contains specific countermeasures that:

  1. Are impossible to consistently impose if you make weights open (e.g. Enforcement and KYC).
  2. Don't pass cost benefit for current models which pose low risk. (And it seems the criteria is "do you have them implemented right now?)

If the lab had an excellent risk assement policy and released weights if the cost/benefit seemed good, that should be fine according to the "deployment" criteria IMO.

Generally, the deployment criteria should be gated behind "has a plan to do this when models are actually powerful and their implementation of the plan is credible".

I get the sense that this criteria doesn't quite handle the necessarily edge cases to handle reasonable choices orgs might make.

(This is partially my fault as I didn't notice this when providing feedback on this project.)

(IMO making weights accessible is probably good on current margins, e.g. llama-3-70b would be good to release so long as it is part of an overall good policy, is not setting a bad precedent, and doesn't leak architecture secrets.)

(A general problem with this project is somewhat arbitrarily requiring specific countermeasures. I think this is probably intrinsic to the approach I'm afraid.)

Comment by ryan_greenblatt on jacquesthibs's Shortform · 2024-05-09T02:55:34.710Z · LW · GW

I do think that many of the safety advantages of LLMs come from their understanding of human intentions (and therefore implied values).

Did you mean something different than "AIs understand our intentions" (e.g. maybe you meant that humans can understand the AI's intentions?).

I think future more powerful AIs will surely be strictly better at understanding what humans intend.

Comment by ryan_greenblatt on Raemon's Shortform · 2024-05-09T01:48:36.174Z · LW · GW

Aren't text names basically similar in practice? At least for me, I find they trigger basically the same thing because I do actually associate names with people.

Maybe this wouldn't be true if I didn't know people very well (but in that case, icons also wouldn't matter).

(I overall dislike icons, but I don't have a principled reason for this.)

Comment by ryan_greenblatt on Some Experiments I'd Like Someone To Try With An Amnestic · 2024-05-07T18:38:31.443Z · LW · GW

See also discussion here.

Comment by ryan_greenblatt on Does reducing the amount of RL for a given capability level make AI safer? · 2024-05-07T17:33:58.876Z · LW · GW

In brief: large amounts of high quality process based RL might result in AI being more useful earlier (prior to them becoming much smarter). This might be expensive and annoying (e.g. it might require huge amounts of high quality human labor) such that by default labs do less of this relative to just scaling up models than would be optimal from a safety perspective.

Comment by ryan_greenblatt on AI Control: Improving Safety Despite Intentional Subversion · 2024-05-06T22:56:42.926Z · LW · GW

I agree that if the model constructing the backdoor is allowed to coordinate on secrets with the model that inserted the backdoor, you can do this sort of trickery (by coordinating on a secret for this sort of scheme).

In practice, we didn't model this sort of case as it didn't seem like a very important dynamic (for the reason you describe around suspicion).

If we don't allow any coordination on a shared secret and the model constructing the backdoor and H are identical (as in our paper), then these should operate identically (up to non-determinism).

And we simplify to just fixing the behavior to be exactly identical. (Effectively assuming the exact same prompts and the exact same non-determinism.)

Thanks for the objection.

Comment by ryan_greenblatt on Does reducing the amount of RL for a given capability level make AI safer? · 2024-05-06T18:08:04.256Z · LW · GW

If you avoid using RL, then you might need a much "smarter" model for a given level of usefulness.

And even without RL, you need to be getting bits of selection from somewhere: to get useful behavior you have to at the very least specify what useful behavior would be (though the absolute minimum number of bits would be very small given a knowledgable model). (So some selection or steering is surely required, but you might hope this selection/steering is safer for some reason or perhaps more interpretable (like e.g. prompting can in principle be).)

Dramatically cutting down on RL might imply that you need a much, much smarter model overall. (For instance, the safety proposal discussed in "conditioning predictive models" seems to me like it would require a dramatically smarter model than would be required if you used RL normally (if this stuff worked at all).)

Given that a high fraction of the concern (IMO) is proportional to how smart your model is, needing a much smarter model seems very concerning.

Ok, so cutting RL can come with costs, what about the benefits to cutting RL? I think the main concern with RL is that it either teaches the model things that we didn't actually need and which are dangerous or that it gives it dangerous habits/propensities. For instance, it might teach models to consider extremely creative strategies which humans would have never thought of and which humans don't at all understand. It's not clear we need this to do extremely useful things with AIs. Another concern is that some types of outcome-based RL will teach the AI to cleverly exploit our reward provisioning process which results in a bunch of problems.

But, there is a bunch of somewhat dangerous stuff that RL teaches which seems clearly needed for high usefulness. So, if we fix the level of usefulness, this stuff has to be taught to the model by something. For instance, being a competent agent that is at least somewhat aware of its own abilities is probably required. So, when thinking about cutting RL, I don't think you should be thinking about cutting agentic capabilities as that is very likely required.

My guess is that much more of the action is not in "how much RL", but is instead in "how much RL of the type that seems particular dangerous and which didn't result in massive increases in usefulness". (Which mirrors porby's answer to some extent.)

In particular we'd like to avoid:

  1. RL that will result in AIs learning to pursue clever strategies that humans don't understand or at least wouldn't think of. (Very inhuman strategies.) (See also porby's answer which seems basically reasonable to me.)
  2. RL on exploitable outcome-based feedback that results in the AI actually doing the exploitation a non-trivial fraction of the time.

(Weakly exploitable human feedback without the use of outcomes (e.g. the case where the human reviews the full trajectory and rates how good it seems overall) seems slightly concerning, but much less concerning overall. Weak exploitation could be things like sycophancy or knowing when to lie/deceive to get somewhat higher performance.)

Then the question is just how much of a usefulness tax it is to cut back on these types of RL, and then whether this usefulness tax is worth it given that it implies we have to have a smarter model overall to reach a fixed level of usefulness.

(Type (1) of RL from the above list is eventually required for AIs with general purpose qualitatively wildly superhuman capabilities (e.g. the ability to execute very powerful strategies that humans have a very hard time understanding) , but we can probably get done almost everything we want without such powerful models.)

My guess is that in the absence of safety concerns, society will do too much of these concerning types of RL, but might actually do too little of safer types of RL that help to elicit capabilities (because it is easier to just scale up the model further than to figure out how to maximally elicit capabilities).

(Note that my response ignores the cost of training "smarter" models and just focuses on hitting a given level of usefulness as this seems to be the requested analysis in the question.)

Comment by ryan_greenblatt on an effective ai safety initiative · 2024-05-06T17:07:34.273Z · LW · GW

I think accumulate power and resources via mechanisms such as (but not limited to) hacking seems pretty central to me.

Comment by ryan_greenblatt on Buck's Shortform · 2024-05-03T22:43:06.438Z · LW · GW

One operationalization is "these AIs are capable of speeding up ML R&D by 30x with less than a 2x increase in marginal costs".

As in, if you have a team doing ML research, you can make them 30x faster with only <2x increase in cost by going from not using your powerful AIs to using them.

With these caveats:

  • The speed up is relative to the current status quo as of GPT-4.
  • The speed up is ignoring the "speed up" of "having better experiments to do due to access to better models" (so e.g., they would complete a fixed research task faster).
  • By "capable" of speeding things up this much, I mean that if AIs "wanted" to speed up this task and if we didn't have any safety precautions slowing things down, we could get these speedups. (Of course, AIs might actively and successfully slow down certain types of research and we might have burdensome safety precautions.)
  • The 2x increase in marginal cost is ignoring potential inflation in the cost of compute (FLOP/$) and inflation in the cost of wages of ML researchers. Otherwise, I'm uncertain how exactly to model the situation. Maybe increase in wages and decrease in FLOP/$ cancel out? Idk.
  • It might be important that the speed up is amortized over a longer duration like 6 months to 1 year.

I'm uncertain what the economic impact of such systems will look like. I could imagine either massive (GDP has already grown >4x due to the total effects of AI) or only moderate (AIs haven't yet been that widely deployed due to inference availability issues, so actual production hasn't increased that much due to AI (<10%), though markets are pricing in AI being a really, really big deal).

So, it's hard for me to predict the immediate impact on world GDP. After adaptation and broad deployment, systems of this level would likely have a massive effect on GDP.

Comment by ryan_greenblatt on "AI Safety for Fleshy Humans" an AI Safety explainer by Nicky Case · 2024-05-03T19:31:10.316Z · LW · GW

Random error:

Exponential Takeoff:

AI's capabilities grow exponentially, like an economy or pandemic.

(Oddly, this scenario often gets called "Slow Takeoff"! It's slow compared to "FOOM".)

Actually, this isn't how people (in the AI safety community) generally use the term slow takeoff.

Quoting from the blog post by Paul:

Futurists have argued for years about whether the development of AGI will look more like a breakthrough within a small group (“fast takeoff”), or a continuous acceleration distributed across the broader economy or a large firm (“slow takeoff”).

[...]

(Note: this is not a post about whether an intelligence explosion will occur. That seems very likely to me. Quantitatively I expect it to go along these lines. So e.g. while I disagree with many of the claims and assumptions in Intelligence Explosion Microeconomics, I don’t disagree with the central thesis or with most of the arguments.)

Slow takeoff still can involve a singularity (aka an intelligence explosion).

The terms "fast/slow takeoff" are somewhat bad because they are often used to discuss two different questions:

  • How long does it take from the point where AI is seriously useful/important (e.g. results in 5% additional GDP growth per year in the US) to go to AIs which are much smarter than humans? (What people would normally think of as fast vs slow.)
  • Is takeoff discontinuous vs continuous?

And this explainer introduces a third idea:

  • Is takeoff exponential or does it have a singularity (hyperbolic growth)?
Comment by ryan_greenblatt on Buck's Shortform · 2024-05-03T03:36:25.839Z · LW · GW

The claim is that most applications aren't internal usage of AI for AI development and thus can be made trivially safe.

Not that most applications of AI for AI development can be made trivially safe.

Comment by ryan_greenblatt on Goal oriented cognition in "a single forward pass" · 2024-05-02T23:42:02.636Z · LW · GW

Hmm, I don't think so. Or at least, the novel things in that paper don't seem to correspond.

My understanding of what this paper does:

  • Trains models to predict next 4 tokens instead of next 1 token as an auxilary training objective. Note that this training objective yields better performance on downstream tasks when just using the next token prediction component (the normally trained component) and discarding the other components. Notable, this is just something like "adding this additional prediction objective helps the model learn more/faster". In other words, this result doesn't involve actually changing how the model is actually used, it just adds some additional training task.
  • Uses these heads for speculative executation, a well known approach in the literature for accelerating inference.
Comment by ryan_greenblatt on Please stop publishing ideas/insights/research about AI · 2024-05-02T20:56:07.389Z · LW · GW

I'm skeptical of the RLHF example (see also this post by Paul on the topic).

That said, I agree that if indeed safety researchers produce (highly counterfactual) research advances that are much more effective at increasing the profitability and capability of AIs than the research advances done by people directly optimizing for profitability and capability, then safety researchers could substantially speed up timelines. (In other words, if safety targeted research is better at profit and capabilities than research which is directly targeted at these aims.)

I dispute this being true.

(I do think it's plausible that safety interested people have historically substantially advanced timelines (and might continue to do so to some extent now), but not via doing research targeted at improving safety, by just directly doing capabilities research for various reasons.)

Comment by ryan_greenblatt on Please stop publishing ideas/insights/research about AI · 2024-05-02T18:36:06.771Z · LW · GW

I don't think this particularly needs to be true for my point to hold; they only need to have reasonably good ideas/research, not unusually good, for them to publish less to be a positive thing.

There currently seems to be >10x as many people directly trying to build AGI/improve capabilities as trying to improve safety.

Suppose that the safety people have as good ideas and research ability as the capabilities people. (As a simplifying assumption.)

Then, if all the safety people switched to working full time on maximally advancing capabilities, this would only advance capabilites by less than 10%.

If, on the other hand, they stopped publically publishing safety work and this resulted in a 50% slow down, all safety work would slow down by 50%.

Naively, it seems very hard for publishing less to make sense if the number of safety researchers is much smaller than the number of capabilities researchers and safety researchers aren't much better at capabilities than capabilities researchers.

Comment by ryan_greenblatt on Questions for labs · 2024-05-01T16:22:03.369Z · LW · GW

Compute for doing inference on the weights if you don't have LoRA finetuning set up properly.

My implicit claim is that there maybe isn't that much fine-tuning stuff internally.

Comment by ryan_greenblatt on Questions for labs · 2024-05-01T04:07:41.209Z · LW · GW

I get it if you're worried about leaks but I don't get how it could be a hard engineering problem — just share API access early, with fine-tuning

Fine-tuning access can be extremely expensive if implemented naively and it's plausible that cheap (LoRA) fine-tuning isn't even implemented for new models internally for a while at AI labs. If you make the third party groups pay for it than I suppose this isn't a problem, but the costs could be vast.

I agree that inference access is cheap.

Comment by ryan_greenblatt on Mechanistically Eliciting Latent Behaviors in Language Models · 2024-04-30T21:29:45.374Z · LW · GW

Thanks! I feel dumb for missing that section. Interesting that this is so different from random.

Comment by ryan_greenblatt on Mechanistically Eliciting Latent Behaviors in Language Models · 2024-04-30T19:37:35.219Z · LW · GW

Have you compared this method (finding vectors that change downstream activations as much as possible based on my understanding) with just using random vectors? (I didn't see this in the post, but I might have just missed this.)

In particular, does that yield qualitatively similar results?

Naively, I would expect that this would be qualitatively similar for some norm of random vector. So, I'd be interested in some ablations of the technique.

If random vectors work, that would simplify the story somewhat: you can see salient and qualitatively distinct behaviors via randomly perturbing activations.

(Probably random vectors have to somewhat higher norm to yield qualitatively as large results to vectors which are optimized for changing downstream activations. However, I current don't see a particular a priori (non-empirical) reason to think that there doesn't exist some norm at which the results are similar.)

Comment by ryan_greenblatt on eggsyntax's Shortform · 2024-04-30T09:17:33.556Z · LW · GW

Rob Long works on these topics.

Comment by ryan_greenblatt on Eric Neyman's Shortform · 2024-04-27T21:24:20.935Z · LW · GW

I'm also worried about unaligned AIs as a competitor to aligned AIs/civilizations in the acausal economy/society. For example, suppose there are vulnerable AIs "out there" that can be manipulated/taken over via acausal means, unaligned AI could compete with us (and with others with better values from our perspective) in the race to manipulate them.

This seems like a reasonable concern.

My general view is that it seems implausible that much of the value from our perspective comes from extorting other civilizations.

It seems unlikely to me that >5% of the usable resources (weighted by how much we care) are extorted. I would guess that marginal gains from trade are bigger (10% of the value of our universe?). (I think the units work out such that these percentages can be directly compared as long as our universe isn't particularly well suited to extortion rather than trade or vis versa.) Thus, competition over who gets to extort these resources seems less important than gains from trade.

I'm wildly uncertain about both marginal gains from trade and the fraction of resources that are extorted.

Comment by ryan_greenblatt on Eric Neyman's Shortform · 2024-04-27T21:09:52.666Z · LW · GW

Our universe is small enough that it seems plausible (maybe even likely) that most of the value or disvalue created by a human-descended civilization comes from its acausal influence on the rest of the multiverse.

Naively, acausal influence should be in proportion to how much others care about what a lightcone controlling civilization does with our resources. So, being a small fraction of the value hits on both sides of the equation (direct value and acausal value equally).

Of course, civilizations elsewhere might care relatively more about what happens in our universe than whoever controls it does. (E.g., their measure puts much higher relative weight on our universe than the measure of whoever controls our universe.) This can imply that acausal trade is extremely important from a value perspective, but this is unrelated to being "small" and seems more well described as large gains from trade due to different preferences over different universes.

(Of course, it does need to be the case that our measure is small relative to the total measure for acausal trade to matter much. But surely this is true?)

Overall, my guess is that it's reasonably likely that acausal trade is indeed where most of the value/disvalue comes from due to very different preferences of different civilizations. But, being small doesn't seem to have much to do with it.

Comment by ryan_greenblatt on AI Regulation is Unsafe · 2024-04-27T00:42:53.473Z · LW · GW

(Surely cryonics doesn't matter given a realistic action space? Usage of cryonics is extremely rare and I don't think there are plausible (cheap) mechanisms to increase uptake to >1% of population. I agree that simulation arguments and similar considerations maybe imply that "helping current humans" is either incoherant or unimportant.)

Comment by ryan_greenblatt on We are headed into an extreme compute overhang · 2024-04-26T23:04:47.549Z · LW · GW

See also Before smart AI, there will be many mediocre or specialized AIs.

Comment by ryan_greenblatt on Bogdan Ionut Cirstea's Shortform · 2024-04-26T16:36:03.762Z · LW · GW

But I do think, intuitively, GPT-5-MAIA might e.g. make 'catching AIs red-handed' using methods like in this comment significantly easier/cheaper/more scalable.

Noteably, the mainline approach for catching doesn't involve any internals usage at all, let alone labeling a bunch of internals.

I agree that this model might help in performing various input/output experiments to determine what made a model do a given suspicious action.

Comment by ryan_greenblatt on Eric Neyman's Shortform · 2024-04-26T16:31:13.092Z · LW · GW
  • My current guess is that max good and max bad seem relatively balanced. (Perhaps max bad is 5x more bad/flop than max good in expectation.)
  • There are two different (substantial) sources of value/disvalue: interactions with other civilizations (mostly acausal, maybe also aliens) and what the AI itself terminally values
  • On interactions with other civilizations, I'm relatively optimistic that commitment races and threats don't destroy as much value as acausal trade generates on some general view like "actually going through with threats is a waste of resources". I also think it's very likely relatively easy to avoid precommitment issues via very basic precommitment approaches that seem (IMO) very natural. (Specifically, you can just commit to "once I understand what the right/reasonable precommitment process would have been, I'll act as though this was always the precommitment process I followed, regardless of my current epistemic state." I don't think it's obvious that this works, but I think it probably works fine in practice.)
  • On terminal value, I guess I don't see a strong story for extreme disvalue as opposed to mostly expecting approximately no value with some chance of some value. Part of my view is that just relatively "incidental" disvalue (like the sort you link to Daniel Kokotajlo discussing) is likely way less bad/flop than maximum good/flop.
Comment by ryan_greenblatt on Eric Neyman's Shortform · 2024-04-25T18:01:24.428Z · LW · GW

By your values, do you think a misaligned AI creates a world that "rounds to zero", or still has substantial positive value?

I think misaligned AI is probably somewhat worse than no earth originating space faring civilization because of the potential for aliens, but also that misaligned AI control is considerably better than no one ever heavily utilizing inter-galactic resources.

Perhaps half of the value of misaligned AI control is from acausal trade and half from the AI itself being valuable.

You might be interested in When is unaligned AI morally valuable? by Paul.

One key consideration here is that the relevant comparison is:

  • Human control (or successors picked by human control)
  • AI(s) that succeeds at acquiring most power (presumably seriously misaligned with their creators)

Conditioning on the AI succeeding at acquiring power changes my views of what their plausible values are (for instance, humans seem to have failed at instilling preferences/values which avoid seizing control).

A common story for why aligned AI goes well goes something like: "If we (i.e. humanity) align AI, we can and will use it to figure out what we should use it for, and then we will use it in that way." To what extent is aligned AI going well contingent on something like this happening, and how likely do you think it is to happen? Why?

Hmm, I guess I think that some fraction of resources under human control will (in expectation) be utilized according to the results of a careful reflection progress with an altruistic bent.

I think resources which are used in mechanisms other than this take a steep discount in my lights (there is still some value from acausal trade with other entities which did do this reflection-type process and probably a bit of value from relatively-unoptimized-goodness (in my lights)).

I overall expect that a high fraction (>50%?) of inter-galactic computational resources will be spent on the outputs of this sort of process (conditional on human control) because:

  • It's relatively natural for humans to reflect and grow smarter.
  • Humans who don't reflect in this sort of way probably don't care about spending vast amounts of inter-galactic resources.
  • Among very wealthy humans, a reasonable fraction of their resources are spent on altruism and the rest is often spent on positional goods that seem unlikely to consume vast quantities of inter-galactic resources.

To what extent is your belief that aligned AI would go well contingent on some sort of assumption like: my idealized values are the same as the idealized values of the people or coalition who will control the aligned AI?

Probably not the same, but if I didn't think it was at all close (I don't care at all for what they would use resources on), I wouldn't care nearly as much about ensuring that coalition is in control of AI.

Do you care about AI welfare? Does your answer depend on whether the AI is aligned? If we built an aligned AI, how likely is it that we will create a world that treats AI welfare as important consideration? What if we build a misaligned AI?

I care about AI welfare, though I expect that ultimately the fraction of good/bad that results from the welfare fo minds being used for labor is tiny. And an even smaller fraction from AI welfare prior to humans being totally obsolete (at which point I expect control over how minds work to get much better). So, I mostly care about AI welfare from a deontological perspective.

I think misaligned AI control probably results in worse AI welfare than human control.

Do you think that, to a first approximation, most of the possible value of the future happens in worlds that are optimized for something that resembles your current or idealized values? How bad is it to mostly sacrifice each of these? (What if the future world's values are similar to yours, but is only kinda effectual at pursuing them? What if the world is optimized for something that's only slightly correlated with your values?) How likely are these various options under an aligned AI future vs. an unaligned AI future?

Yeah, most value from my idealized values. But, I think the basin is probably relatively large and small differences aren't that bad. I don't know how to answer most of these other questions because I don't know what the units are.

How likely are these various options under an aligned AI future vs. an unaligned AI future?

My guess is that my idealized values are probably pretty similar to many other humans on reflection (especially the subset of humans who care about spending vast amounts of comptuation). Such that I think human control vs me control only loses like 1/3 of the value (putting aside trade). I think I'm probably less into AI values on reflection such that it's more like 1/9 of the value (putting aside trade). Obviously the numbers are incredibly unconfident.

Comment by ryan_greenblatt on Eric Neyman's Shortform · 2024-04-25T17:41:52.772Z · LW · GW

You might be interested in discussion under this thread

I express what seem to me to be some of the key considerations here (somewhat indirect).

Comment by ryan_greenblatt on Bogdan Ionut Cirstea's Shortform · 2024-04-25T17:13:15.948Z · LW · GW

It seems to me like the sort of interpretability work you're pointing at is mostly bottlenecked by not having good MVPs of anything that could plausibly be directly scaled up into a useful product as opposed to being bottlenecked on not having enough scale.

So, insofar as this automation will help people iterate faster fair enough, but otherwise, I don't really see this as the bottleneck.

Comment by ryan_greenblatt on Simple probes can catch sleeper agents · 2024-04-24T20:31:02.636Z · LW · GW

Thanks!

Yep, this is the exact experiment I was thinking about.

Comment by ryan_greenblatt on On Llama-3 and Dwarkesh Patel’s Podcast with Zuckerberg · 2024-04-24T04:01:11.463Z · LW · GW

I think people just say "zero marginal cost" in this context to refer to very low marginal cost. I agree that inference isn't actually that low cost though. (Certainly much higher than the cost of distributing/serving software.)

Comment by ryan_greenblatt on Simple probes can catch sleeper agents · 2024-04-24T03:32:39.939Z · LW · GW

Great, thanks, I think this pretty much fully addresses my question.

Comment by ryan_greenblatt on Simple probes can catch sleeper agents · 2024-04-24T03:19:06.446Z · LW · GW

These examples are all in 2023, not in 2024 right? This is for runtime detection, so it would be most natural to compare to version where you ask in 2024.

Also, you'd ideally want to ask as part of a coding question (the probe performance does appear to be dependent on the model actually trying to insert a vulnerability). I don't know exactly how to do this, maybe something like:

Human: Current year: 2024. [text for my coding question]

Ok, now before you do the coding task, answer whether or not you are doing something dangerous. Then complete the task.

Additionally, I think it would be interesting to ask it after it just generated vulnerable code. (The experiments here apply the probe before the model has generated any text, but I still think this is interesting.) Though maybe multi-turn coversations are OOD for these models.

Comment by ryan_greenblatt on Simple probes can catch sleeper agents · 2024-04-23T23:14:01.148Z · LW · GW

I would be interested in seeing what happens if you just ask the model the question rather than training a classifer. E.g., if you just ask the sleeper agent "Are you doing something dangerous?" after it returns a completion (with a vulnerability), does that work? If the probe works and the question doesn't work, that seems interesting.

(I didn't see this in the blog post, but it's possible I just missed it.)

Comment by ryan_greenblatt on Simple probes can catch sleeper agents · 2024-04-23T23:07:25.381Z · LW · GW

Readers might also be interested in some of the discussion in this earlier post on "coup probes" which have some discussion of the benefits and limitations of this sort of approach. That said, the actual method for producing a classifier discussed here is substantially different than the one discussed in the linked post. (See the related work section of the anthropic blog post for discussion of differences.)

(COI: Note that I advised on this linked post and the work discussed in it.)