Posts

Why imperfect adversarial robustness doesn't doom AI control 2024-11-18T16:05:06.763Z
Win/continue/lose scenarios and execute/replace/audit protocols 2024-11-15T15:47:24.868Z
Sabotage Evaluations for Frontier Models 2024-10-18T22:33:14.320Z
Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming 2024-10-10T13:36:53.810Z
How to prevent collusion when using untrusted models to monitor each other 2024-09-25T18:58:20.693Z
A basic systems architecture for AI agents that do autonomous research 2024-09-23T13:58:27.185Z
Distinguish worst-case analysis from instrumental training-gaming 2024-09-05T19:13:34.443Z
Would catching your AIs trying to escape convince AI developers to slow down or undeploy? 2024-08-26T16:46:18.872Z
Fields that I reference when thinking about AI takeover prevention 2024-08-13T23:08:54.950Z
Games for AI Control 2024-07-11T18:40:50.607Z
Scalable oversight as a quantitative rather than qualitative problem 2024-07-06T17:42:41.325Z
Different senses in which two AIs can be “the same” 2024-06-24T03:16:43.400Z
Access to powerful AI might make computer security radically easier 2024-06-08T06:00:19.310Z
AI catastrophes and rogue deployments 2024-06-03T17:04:51.206Z
Notes on control evaluations for safety cases 2024-02-28T16:15:17.799Z
Toy models of AI control for concentrated catastrophe prevention 2024-02-06T01:38:19.865Z
The case for ensuring that powerful AIs are controlled 2024-01-24T16:11:51.354Z
Managing catastrophic misuse without robust AIs 2024-01-16T17:27:31.112Z
Catching AIs red-handed 2024-01-05T17:43:10.948Z
Measurement tampering detection as a special case of weak-to-strong generalization 2023-12-23T00:05:55.357Z
Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem 2023-12-16T05:49:23.672Z
AI Control: Improving Safety Despite Intentional Subversion 2023-12-13T15:51:35.982Z
How useful is mechanistic interpretability? 2023-12-01T02:54:53.488Z
Untrusted smart models and trusted dumb models 2023-11-04T03:06:38.001Z
Programmatic backdoors: DNNs can use SGD to run arbitrary stateful computation 2023-10-23T16:37:45.611Z
Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy 2023-07-26T17:02:56.456Z
A freshman year during the AI midgame: my approach to the next year 2023-04-14T00:38:49.807Z
One-layer transformers aren’t equivalent to a set of skip-trigrams 2023-02-17T17:26:13.819Z
Trying to disambiguate different questions about whether RLHF is “good” 2022-12-14T04:03:27.081Z
Causal scrubbing: results on induction heads 2022-12-03T00:59:18.327Z
Causal scrubbing: results on a paren balance checker 2022-12-03T00:59:08.078Z
Causal scrubbing: Appendix 2022-12-03T00:58:45.850Z
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] 2022-12-03T00:58:36.973Z
Multi-Component Learning and S-Curves 2022-11-30T01:37:05.630Z
Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small 2022-10-28T23:55:44.755Z
Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley 2022-10-27T01:32:44.750Z
Help out Redwood Research’s interpretability team by finding heuristics implemented by GPT-2 small 2022-10-12T21:25:00.459Z
Polysemanticity and Capacity in Neural Networks 2022-10-07T17:51:06.686Z
Language models seem to be much better than humans at next-token prediction 2022-08-11T17:45:41.294Z
Adversarial training, importance sampling, and anti-adversarial training for AI whistleblowing 2022-06-02T23:48:30.079Z
The prototypical catastrophic AI action is getting root access to its datacenter 2022-06-02T23:46:31.360Z
The case for becoming a black-box investigator of language models 2022-05-06T14:35:24.630Z
Apply to the second iteration of the ML for Alignment Bootcamp (MLAB 2) in Berkeley [Aug 15 - Fri Sept 2] 2022-05-06T04:23:45.164Z
Takeoff speeds have a huge effect on what it means to work on AI x-risk 2022-04-13T17:38:11.990Z
Worst-case thinking in AI alignment 2021-12-23T01:29:47.954Z
Apply to the ML for Alignment Bootcamp (MLAB) in Berkeley [Jan 3 - Jan 22] 2021-11-03T18:22:58.879Z
Redwood Research’s current project 2021-09-21T23:30:36.989Z
The theory-practice gap 2021-09-17T22:51:46.307Z
The alignment problem in different capability regimes 2021-09-09T19:46:16.858Z
Funds are available to support LessWrong groups, among others 2021-07-21T01:11:29.981Z

Comments

Comment by Buck on Buck's Shortform · 2024-11-20T02:29:10.836Z · LW · GW

Project proposal: No-holds-barred jailbreak attacks and defenses for control

Normally we’ve done control research under the assumption that the red team isn’t able to use techniques like GCG to make targeted jailbreaks for the trusted monitor. Can the blue team defend if we relax that assumption (e.g. because we’re worried the model might be able to start a rogue deployment in which it runs GCG)?

The blue team can try a variety of strategies here.

  • The simplest strategy is randomizing the prompt to the trusted monitor. This forces the red team to find a jailbreak that works over multiple prompts.
  • There are fancier approaches here, e.g. giving the trusted model a randomized prompt that says something like “if the code is safe, reply with the number 1385, otherwise reply with the number 9247”, which breaks jailbreaks which just convince the trusted model to say a particular fixed string.
  • Blue team can modify the current input to contain a known-bad substring, and check that it gets flagged.
Comment by Buck on Buck's Shortform · 2024-11-20T02:25:18.909Z · LW · GW

Project proposal: Control for auditing

(I'm sharing this here because I needed to link to it, and I thought I might as well make it public.)

When preventing high stakes failures, a crucial bottleneck is the proportion of actions that you can audit: that is, investigate the action further for the purpose of finding evidence that the AI is scheming. (See here for more on this. In particular, note that this might be different from the problem of investigating an action for the purpose of learning whether that action is dangerous.) If we need to rely on human labor to audit, we’ll only be able to audit a tiny fraction of actions that our AI agents are doing. It would be extremely helpful if we were able to use AI assistance for auditing, from both trusted and untrusted models. But obviously auditing with untrusted models leads to massive risk of collusion. So this setting seems potentially pretty hard and potentially crucial; I’m excited for a project investigating it.


This is pretty closely related to untrusted monitoring; you could also call this project “deep dive into untrusted monitoring”.

The simplest such setting I can think of

We’re going to take the APPS backdooring setting from AI Control: Improving Safety Despite Intentional Subversion, fix the main control technique to be trusted monitoring with 1% auditing budget and defer-to-trusted for the 10% most suspicious solutions, and focus on the problem of implementing an auditor. That is:

  • Get a dataset of backdoor attempts.
    • For statistical power, this dataset will need to be large (at least 100; it would be better for it to be more like 1000). This might require developing a better/cheaper backdoor generation pipeline.
    • It might be good to explore backdooring in a harder setting than the interview subset of APPS. Maybe the advanced subset of APPS would be better.
    • The red team will be allowed to modify these backdoor attempts to add collusion signals. (These modifications will probably be cheap and easy compared to generating the backdoor attempts.)
  • Get a dataset of benign code, filter with a trusted monitor for the most suspicious 1%. (This should be pretty cheap and easy)

We want to implement a first-pass auditor (better name TBD). That is, we want the actual audit process to have two stages:

  • Firstly, some AI scaffold (the first-pass auditor) carefully investigates the sketchiest 1% of model-proposed code.
  • That scaffold needs to choose 1% of the code it reviews to pass along to a human reviewer.

The first-pass auditor is allowed to be somewhat expensive compared to generating the original code, because it’s only getting called 1% of the time. For our purposes, maybe let’s say it can cost up to 20x as much.

And we’re doing a control eval: the untrusted model will, as usual, be replaced by an attack policy.

Note that in this control eval there’s no notion of usefulness: our only metric is the probability that the red team is incriminated on a given backdoor attempt. (Actually this might be more confusing than I’ve written here, because we might want to do honeypotting of the untrusted model in the first-pass auditing setup, which leads to the game varying with scale for the usual reasons (see untrusted monitoring in our original control paper).)

Techniques

  • Get the untrusted model to reimplement the function, compare the implementations.
  • Ask the untrusted model to explain why various parts of the function are the way they are; make collusion hard.
  • Show the untrusted model only a small amount of the context (e.g. the action but not the rest of the trajectory), have it ask a trusted model for information that it needs to know in order to assess whether the action was dangerous; hope that it can’t learn whether to collude given this restricted information channel.

There are many potential options here.

My guess is that this project would actually be much better with more complicated programming problems than APPS.

Comment by Buck on Why imperfect adversarial robustness doesn't doom AI control · 2024-11-19T22:00:09.659Z · LW · GW

I also asked it whether it consented to being a coauthor, and it said yes.

Comment by Buck on Why imperfect adversarial robustness doesn't doom AI control · 2024-11-19T21:07:21.352Z · LW · GW

It seems like that assumption runs throughout the post though, eg 'But it won’t know what it actually tried, unless it is able to guess very accurately what it would have said', 'the AI might not even be able to get direct evidence of whether its attempts worked, or what its attempts even were.'

Those sentences are talking in the context of defer to resample, where I think it's actually reasonable to think the model won't see the resampled actions.

I don't just mean showing the model its past CoT. Current models are often given lots of information about their past actions and interactions so that they can produce better output; why would you expect that to change?

I'm specifically talking about not showing it actions that were resampled. You're totally right that they'll need other information about their actions and interactions (though it might be a good safety-usefulness tradeoff to somewhat redact this information).

Is the assumption that labs will be willing to accept worse performance because they recognize the model can't be trusted?

Yes, I am assuming that the deployer is willing to spend some performance in order to reduce risk from the model scheming.

Comment by Buck on Why imperfect adversarial robustness doesn't doom AI control · 2024-11-19T17:56:40.280Z · LW · GW

Hmmm, good call, it will probably see the history of its chain of thought. My dear coauthor Claude wrote that and I didn't second guess it, I agree it's wrong.

Comment by Buck on Why imperfect adversarial robustness doesn't doom AI control · 2024-11-18T22:21:16.689Z · LW · GW

Yep! You might have to show it CoT from previous, non-resampled steps though

Comment by Buck on Why imperfect adversarial robustness doesn't doom AI control · 2024-11-18T19:35:54.757Z · LW · GW

It would be crazy to hide that CoT from the monitor.

Comment by Buck on Why imperfect adversarial robustness doesn't doom AI control · 2024-11-18T16:55:04.733Z · LW · GW

Why does it know its attempt was replaced? Because it can guess it wouldn't have produced R?

Yep.

If it can tell that, how come it can't guess it specifically would have produced A?

I'm just hypothesizing that the first guessing problem is easier than the second; I intuitively expect that it's a lot easier to notice that something isn't your style than to guess what exactly you would have done (among other things, there are many more bits in "what action did I take" than "did I take this action").

Comment by Buck on Why imperfect adversarial robustness doesn't doom AI control · 2024-11-18T16:53:54.085Z · LW · GW

For the main body, I gave it a first draft, it wrote a much improved second draft, then I edited from there. It wrote the appendix basically itself (with some feedback from me)

Comment by Buck on Win/continue/lose scenarios and execute/replace/audit protocols · 2024-11-17T17:35:16.891Z · LW · GW

Thanks, this is closely related!

Comment by Buck on Fields that I reference when thinking about AI takeover prevention · 2024-11-15T15:52:44.403Z · LW · GW

FWIW, the there are some people around the AI safety space, especially people who work on safety cases, who have that experience. E.g. UK AISI works with some people who are experienced safety analysts from other industries.

Comment by Buck on Untrusted smart models and trusted dumb models · 2024-11-15T15:19:21.305Z · LW · GW

I mostly mean "we are sure that it isn't egregiously unaligned and thus treating us adversarially". So models can be aligned but untrusted (if they're capable enough that we believe they could be schemers, but they aren't actually schemers). There shouldn't be models that are trusted but unaligned.

Everywhere I wrote "unaligned" here, I meant the fairly specific thing of "trying to defeat our safety measures so as to grab power", which is not the only way the word "aligned" is used.

Comment by Buck on Untrusted smart models and trusted dumb models · 2024-11-15T15:16:52.471Z · LW · GW

I think your short definition should include the part about our epistemic status: "We are happy to assume the AI isn't adversarially trying to cause a bad outcome".

Comment by Buck on Buck's Shortform · 2024-11-14T18:37:09.645Z · LW · GW

I don’t understand, can you say more?

Comment by Buck on Buck's Shortform · 2024-11-14T16:32:38.323Z · LW · GW

An example snippet of a rogue internal deployment threat model that I find fairly compelling:

  1. The schemer, while deployed in an authorized scaffold, launches an unmonitored scaffold that has access to trusted model weights.
  2. It uses this scaffold to do a GCG attack on the trusted model, then writes the resulting GCG attack somewhere that the agent in the authorized scaffold can see. This takes a moderate amount of compute, but a small amount compared to the amount the untrusted model will regularly need to run experiments.
  3. The agent in the authorized scaffold uses that GCG attack to do something else that's more bad.

To defend against this, we should try to both prevent the untrusted model from ever getting rogue deployments where it can research jailbreaks on the trusted model, and also try to prevent unacceptable outcomes conditional on the untrusted agent having access to the best jailbreaks you can get with GCG etc.

I find this interesting because it seems kind of hard to stop either step 1 (because rogue deployments might be naturally possible given the affordances the model has) or step 3 (because it's hard to ensure safety against an adversary with white-box access to a model it wants to jailbreak). I'd be excited for more research on both of these.

Comment by Buck on Buck's Shortform · 2024-11-12T17:01:25.903Z · LW · GW

That's a great way of saying it. I edited this into my original comment.

Comment by Buck on Buck's Shortform · 2024-11-12T02:43:10.874Z · LW · GW

Two different meanings of “misuse”

The term "AI misuse" encompasses two fundamentally different threat models that deserve separate analysis and different mitigation strategies:

  • Democratization of offense-dominant capabilities
    • This involves currently weak actors gaining access to capabilities that dramatically amplify their ability to cause harm. That amplification of ability to cause harm is only a huge problem if access to AI didn’t also dramatically amplify the ability of others to defend against harm, which is why I refer to “offense-dominant” capabilities; this is discussed in The Vulnerable World Hypothesis.
    • The canonical example is terrorists using AI to design bioweapons that would be beyond their current technical capacity (c.f.  Aum Shinrikyo, which failed to produce bioweapons despite making a serious effort)
  • Power Concentration Risk
    • This involves AI systems giving already-powerful actors dramatically more power over others
    • Examples could include:
      • Government leaders using AI to stage a self-coup then install a permanent totalitarian regime, using AI to maintain a regime with currently impossible levels of surveillance.
      • AI company CEOs using advanced AI systems to become world dictator.
    • The key risk here is particular already-powerful people getting potentially unassailable advantages

These threats require different solutions:

  • Misuse that involves offense-dominant capabilities can be addressed by preventing users of your AIs from doing catastrophically bad things, e.g. by training the models to robustly refuse requests that could lead to these catastrophic outcomes (which might require improvements in adversarial robustness), or by removing dangerous knowledge from the AI training data.
  • Power concentration risks require different solutions. Technical measures to prevent users from using the AI for particular tasks don’t help against the threat of the lab CEO trying to use the AI for those harmful tasks, or the threat of the US government expropriating the AI system and using it for their own purposes. To resist against these threats, interventions include:
    • Transparency interventions: making it so that more people know about the situation, so it's less likely a tiny conspiracy can grab lots of power. E.g. see 4 Ways to Advance Transparency in Frontier AI Development.
    • And then there are some technical interventions, but all of these suffer from the problem that our main concern here is, as jbash put it in a comment, "By far the most important risk isn't that they'll steal them. It's that they will be fully authorized to misuse them. No security measure can prevent that."
      • Improved computer security. This guards against the risks of third parties stealing the models.
      • Security against insider threats, to make it harder for the AI to be misused internally. This is a classic insider threat problem; addressing it will require both technical interventions and workflow changes inside AI companies.

Many discussions of "AI misuse" focus primarily on interventions that only help with the first category, while using rhetoric that suggests they're addressing both. This creates a motte-and-bailey situation where:

  • The "motte" (easily defensible position) is "we need to prevent terrorists from using AI for mass harm"
  • The "bailey" (broader claim) is "our work on AI misuse prevention will solve the major misuse risks from AI, therefore we aren't causing huge risks through our work"

This conflation is dangerous because it may lead us to overinvest in technical solutions that only address the less concerning risk, and underinvest in countermeasures for power concentration risks.

Comment by Buck on Anthropic: Three Sketches of ASL-4 Safety Case Components · 2024-11-09T15:48:51.901Z · LW · GW

If we can construct fake honeypots that are indistinguishable - to the policy - from real opportunities, and do this at scale, then that could suffice: we could drive the policy's P(deployment | great opportunity) very low. (This relies on how many high-stakes we can create and how many the model will see in deployment.)

Yes totally. (This is impractical if you think that the model has a very large number of opportunities to attack, e.g. if every time it's programming is an opportunity to attack.)

If we can construct fake attacks that are indistinguishable - to the untrusted monitor - from real attacks, then that possibly suffices: we could drive the monitor's prior P(deployment | attack) very low

Indeed; as you know, this is probably easier than driving the policy's P(deployment|great opportunity) sufficiently low, because untrusted monitors have to defect reliably.

If you are saying that we don't need either of these - i.e. that we can get by with honeypots/attacks that are noticeably fake - then that's news to me, and I don't see how you can do that.

What I was trying to say is that if you don't have the ability to generate tasks that look real, you might still be able to honeypot your monitor by taking real trajectories and editing them to contain fake attacks (that you hope the monitor will be unable to distinguish from real attacks). I discuss this here.

Comment by Buck on evhub's Shortform · 2024-11-09T15:10:41.559Z · LW · GW

Another potential benefit of this is that Anthropic might get more experience deploying their models in high-security environments.

Comment by Buck on Thomas Kwa's Shortform · 2024-11-08T19:03:46.648Z · LW · GW

In terms of developing better misalignment risk countermeasures, I think the most important questions are probably:

  • How to evaluate whether models should be trusted or untrusted: currently I don't have a good answer and this is bottlenecking the efforts to write concrete control proposals.
  • How AI control should interact with AI security tools inside labs.

More generally:

Comment by Buck on Anthropic: Three Sketches of ASL-4 Safety Case Components · 2024-11-08T18:08:14.745Z · LW · GW

I don't think this makes any sense. How are you hoping to get the model to attack except by telling it that it's in a control evaluation and you want it to attack? It seems that you are definitely going to have to handle the sandbagging.

Comment by Buck on Anthropic: Three Sketches of ASL-4 Safety Case Components · 2024-11-07T16:25:03.357Z · LW · GW

Note that you can make untrusted monitoring work without the synthetic inputs too, you just have to do the honeypots during deployment instead of before deployment.

Comment by Buck on Anthropic: Three Sketches of ASL-4 Safety Case Components · 2024-11-07T03:01:37.662Z · LW · GW

I think the control safety case described here is moderately different from what I’d recommend. In particular, this post describes needing to generate situations that are indistinguishable from real situations:
 

The situations for the game were generated using Simulation Procedure S, which we are confident generates situations from a distribution that the model cannot distinguish from the genuine deployment distribution. (See below for validating the indistinguishability of this distribution.)

I don’t know why Roger thinks you need this. Control evaluations for the control protocols listed don’t require situations that the model can’t distinguish from the genuine deployment distribution; it’s fine if the situations are visibly fake. The control evaluation involves getting a red team to elicit the untrusted model to generate attacks, so it doesn’t matter if the model knows that the situation is fake. (There are techniques that require synthetic input generation, e.g. upfront auditing as we presented it in our paper, but trusted monitoring isn’t one of them.)

The condition we do require for the control-style safety case to be valid is that the situations in the evaluation are at least as easy to attack on as the real distribution. (For example, this distribution needs to have as many good opportunities to attack, it needs to be similarly confusing to the trusted model, etc…)

Roger writes:

The idea here is to achieve sufficient realism and sufficient coverage of high-stakes deployment scenarios that the deployed model could never (in a high-stakes scenario) be highly confident that it’s in deployment.

Note that if the model is deployed on many more examples than it’s evaluated on, it can be pretty confident that it’s in deployment on any particular example just on priors, even if there’s no distribution shift. See here for more.

Comment by Buck on Anthropic: Three Sketches of ASL-4 Safety Case Components · 2024-11-06T20:28:38.294Z · LW · GW

What do you think of the arguments in this post that it's possible to make safety cases that don't rely on the model being unlikely to be a schemer?

Comment by Buck on Anthropic: Three Sketches of ASL-4 Safety Case Components · 2024-11-06T19:51:27.312Z · LW · GW

You might not need privacy guarantees for internal deployments ofc

Comment by Buck on Catastrophic sabotage as a major threat model for human-level AI systems · 2024-10-23T17:56:53.813Z · LW · GW

Oh yeah sorry, I misunderstood that

Comment by Buck on Catastrophic sabotage as a major threat model for human-level AI systems · 2024-10-23T14:48:04.111Z · LW · GW

Thanks for writing this up. Some miscellaneous thoughts:

  • My guess is that the safety property you'll want to ensure here is something like "the model is unlikely to succeed at a problematically large quantity of sabotage without us first getting a big (20:1) update that it's engaging in intentional sabotage".
  • I think your internal usage restrictions are wildly unaffordable; it seems extremely unlikely that a sensible AI lab should agree to not deploying this model as a coding agent, or not use it for day-to-day tasks. My guess is that companies should treat the untrusted model similarly to how they treat external contractors (or how they treat employees at organizations that take insider threat seriously).

We're doing some research related to control measures for this kind of threat model, some of which should be out soon.

Comment by Buck on Anthropic rewrote its RSP · 2024-10-17T22:32:44.420Z · LW · GW

What do you see as the main properties required for an organization to serve as such an evaluator?

Comment by Buck on Mark Xu's Shortform · 2024-10-07T14:37:15.177Z · LW · GW

Re a, there’s nothing more specific on this than what we wrote in “the case for ensuring”. But I do think that our answer there is pretty good.

Re b, no, we need to write some version of that up; I think our answer here is ok but not amazing, writing it up is on the list.

Comment by Buck on Dan Braun's Shortform · 2024-10-06T23:26:48.772Z · LW · GW

I'm curious @Dan Braun, why don't you work on computer security (assuming I correctly understand that you don't)?

Comment by Buck on Mark Xu's Shortform · 2024-10-06T23:25:42.842Z · LW · GW

FWIW, I don't think of myself as having a super strong opinion on whether working on control is better than working on preventing people from building catastrophically dangerous AI (and I've seriously considered doing the latter).

(I do feel more optimistic than you about being able to use controlled AIs for something useful.)

Comment by Buck on Dan Braun's Shortform · 2024-10-06T23:23:42.275Z · LW · GW

I think that even after your edit, your argument still applies more broadly than you're giving it credit for: if computer security is going to go poorly, then we're facing pretty serious AI risk even if the safety techniques require trivial effort during deployment.

If your AI is stolen, you face substantial risk even if you had been able to align it (e.g. because you might immediately get into an AI enabled war, and you might be forced to proceed with building more powerful and less-likely-to-be-aligned models because of the competitive pressure).

So I think your argument also pushes against working on alignment techniques.

Comment by Buck on Mark Xu's Shortform · 2024-10-05T02:34:44.889Z · LW · GW

I agree and regret focusing as much as we did 2 in the past; I’m excited for work on “white box control” (there's some under way, and I'm excited for more).

We focused on black box control because it’s much more obvious how you’d make a legible and conservative safety case for it. But in hindsight this was a mistake, I think: people probably aren’t going to actually make legible or conservative safety arguments for their deployments, so there’s less of an advantage than I’d thought.

I plan to write more about white box control soon (but I plan to write about a lot of stuff…)

Comment by Buck on MichaelDickens's Shortform · 2024-10-03T01:34:20.494Z · LW · GW

I agree that I didn’t really have much of an effect on this community’s thinking about AIS until like 2021.

Comment by Buck on MichaelDickens's Shortform · 2024-10-02T14:51:41.817Z · LW · GW

A lot of OP's funding to technical AI safety goes to people outside the main x-risk community (e.g. applications to Ajeya's RFPs).

Comment by Buck on MichaelDickens's Shortform · 2024-10-02T14:41:09.478Z · LW · GW

not even ARC has been able to get OP funding (in that case because of COIs between Paul and Ajeya)

As context, note that OP funded ARC in March 2022.

Comment by Buck on You can, in fact, bamboozle an unaligned AI into sparing your life · 2024-10-01T15:56:18.278Z · LW · GW

Thanks for the discussion Nate, I think this ended up being productive.

Comment by Buck on The Sun is big, but superintelligences will not spare Earth a little sunlight · 2024-09-30T15:37:31.621Z · LW · GW

Another straightforward problem with this argument is that the AI doesn't just have the sun in this hypothetical, it also has the rest of the reachable universe. So the proportion of its resources it would need to spend on leaving us the sunlight is actually dramatically lower than you estimate here, by a factor of 10^20 or something.

(I don't think this is anyone's crux, but I do think it's good to avoid straightforward errors in arguments.)

Comment by Buck on You can, in fact, bamboozle an unaligned AI into sparing your life · 2024-09-30T15:33:50.385Z · LW · GW

I agree. But people often refer to the cost of the solar output that goes to earth, and that particular cost doesn't get paid until late.

Comment by Buck on You can, in fact, bamboozle an unaligned AI into sparing your life · 2024-09-30T13:39:11.253Z · LW · GW

Actually the AI can use powerful sims here: if the AI holds off on killing us until it makes the powerful sims, then if the acausal trade proposed here doesn't work out, it can just kill us then. That lets it avoid the cost of letting us have the tiny share of sunlight, though not the costs of keeping us alive during its early capabilities explosion.

Comment by Buck on You can, in fact, bamboozle an unaligned AI into sparing your life · 2024-09-30T04:06:29.302Z · LW · GW

Yep fair point. Those AIs will plausibly have much more thought put into this stuff than we currently have, but I agree the asymmetry is smaller than I made it sound.

Comment by Buck on You can, in fact, bamboozle an unaligned AI into sparing your life · 2024-09-30T03:00:21.211Z · LW · GW

I think that we should indeed consider what the corvid-god wants at the same point in the future where we're considering building the simulations David describes in this post.

More directly: David isn't proposing we should do particularly different things now, he's just noting an argument that we might take actions later that affect whether unaligned AIs kill us.

Comment by Buck on You can, in fact, bamboozle an unaligned AI into sparing your life · 2024-09-30T00:30:13.307Z · LW · GW

Re "It's costly for AI to leave humans alive", I think the best thing written on this is Paul's comment here, the most relevant part of which is:

  • First, I think you should talk quantitatively. How many more resources can an AI get by killing humans? I'd guess the answer is something like 1 in a billion to 1 in a trillion.
    • If you develop as fast as possible you will wreck the human habitat and incidentally kill a lot of people. It's pretty complicated to figure out exactly how much "keep earth livable enough for human survival" will slow you down, since it depends a lot on the dynamics of the singularity. I would guess more like a month than a year, which results in a miniscule reduction in available resources. I think that (IMO implausible) MIRI-style views would suggest more like hours or days than months.
      • Incidentally, I think "byproducts of rapid industrialization trash Earth's climate" is both much more important than the dyson sphere as well as much more intuitively plausible.
    • You can get energy from harvesting the biosphere, and you can use it to develop slightly faster. This is a rounding error compared to the last factor though.
    • Killing most humans might be the easiest way to prevail in conflict. I think this is especially plausible for weak AI. For very powerful AI, it also seems like a rounding error. Even a moderately advanced civilization could spend much less than 1 in a trillion of its resources to have much less than 1 in a billion chance of being seriously inconvenienced by humanity.
Comment by Buck on You can, in fact, bamboozle an unaligned AI into sparing your life · 2024-09-29T19:53:51.957Z · LW · GW

I don't think you should commit to doing this scheme; I think you should just commit to thinking carefully about this argument post-singularity and doing the scheme if you think it still seems good. Acausal trade is potentially really scary and I don't think you want to make unnecessarily strong commitments.

Comment by Buck on COT Scaling implies slower takeoff speeds · 2024-09-29T17:45:38.834Z · LW · GW

I agree with you that "slow takeoff" is often used, but I think "continuous" is self-evident enough and is enough of a better term that it seems pretty good to switch. This term has been used e.g. here.

Comment by Buck on How to prevent collusion when using untrusted models to monitor each other · 2024-09-26T02:12:47.455Z · LW · GW

Sorry, I don't understand the question, can you rephrase?

Comment by Buck on The Sun is big, but superintelligences will not spare Earth a little sunlight · 2024-09-24T15:53:03.755Z · LW · GW

Maybe you should change the title of this post? It would also help if the post linked to the kinds of arguments he was refuting.

Comment by Buck on The Sun is big, but superintelligences will not spare Earth a little sunlight · 2024-09-24T01:34:20.794Z · LW · GW

As is maybe obvious from my comment, I really disliked this essay and I'm dismayed that people are wasting their time on it. I strong downvoted. LessWrong isn't the place for this kind of sloppy rhetoric.

Comment by Buck on The Sun is big, but superintelligences will not spare Earth a little sunlight · 2024-09-23T20:50:13.124Z · LW · GW

Yeah ok I weakened my positive statement.

Comment by Buck on The Sun is big, but superintelligences will not spare Earth a little sunlight · 2024-09-23T19:31:31.897Z · LW · GW

I wish the title of this made it clear that the post is arguing that ASIs won't spare humanity because of trade, and isn't saying anything about whether ASIs will want to spare humanity for some other reason. This is confusing because lots of people around here (e.g. me and many other commenters on this post) think that ASIs are likely to not kill all humans for some other reason.

(I think the arguments in this post are a vaguely reasonable argument for "ASIs are pretty likely to be scope-sensitively-maximizing enough that it's a big problem for us", and respond to some extremely bad arguments for "ASI wouldn't spare humanity because of trade", though in neither case does the post particularly engage with the counterarguments that are most popular among the most reasonable people who disagree with Eliezer.)