david-duvenaud

Posts
Comments

Posts

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development 2025-01-30T17:03:45.545Z

Sabotage Evaluations for Frontier Models 2024-10-18T22:33:14.320Z

Simple probes can catch sleeper agents 2024-04-23T21:10:47.784Z

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training 2024-01-12T19:51:01.021Z

Comments

Comment by David Duvenaud (david-duvenaud) on The Most Forbidden Technique · 2025-03-17T15:26:42.112Z · LW · GW

As someone who writes these kinds of papers, I try to make an effort to cite the original inspirations when possible. And although I agree with Robin's theory broadly, there are also some mechanical reasons why Yudkowsky in particular is hard to cite.

The most valuable things about the academic paper style as a reader are:
1) Having a clear, short summary (the abstract)
2) Stating the claimed contributions explicitly
3) Using standard jargon, or if not, noting so explicitly
4) A related work section that contrasts one's own position against others'
5) Being explicit about what evidence you're marshalling and where it comes from.
6) Listing main claims explicitly.
7) The best papers include a "limitations" or "why I might be wrong" section.

Yudkowsky mostly doesn't do these things. That doesn't mean he doesn't deserve credit for making a clear and accessible case for many foundational aspects of AI safety. It's just that in any particular context, it's hard to say what, exactly, his claims or contributions were.

In this setting, maybe the most appropriate citation would be something like "as illustrated in many thought experiments by yudkowsky [cite particular sections of the sequences and hpmor], it's dangerous to rely on any protocol for detecting scheming by agents more intelligent than oneself". But that's a pretty broad claim. Maybe I'm being unfair - but it's not clear to me what exactly yudkowsky's work says about the workability of these schemes other than "there be dragons here".

Comment by David Duvenaud (david-duvenaud) on A History of the Future, 2025-2040 · 2025-03-07T19:21:49.253Z · LW · GW

I think people will be told about and sometime notice AIs' biases, but they'll still be the most trustworthy source of information for almost everybody. I think Wikipedia is a good example here - it's obviously biased on many politicized topics, but it's still usually the best source for anyone who doesn't personally know experts or which obscure forums to trust.

Comment by David Duvenaud (david-duvenaud) on The Risk of Gradual Disempowerment from AI · 2025-02-08T13:46:46.432Z · LW · GW

Great question. I think treacherous turn risk is still under-funded in absolute terms. And gradual disempowerment is much less shovel-ready as a discipline.

I think there are two reasons why maybe this question isn't so important to answer:
1) The kinds of skills required might be somewhat disjoint.
2) Gradual disempowerment is perhaps a subset or extension of the alignment problem. As Ryan Greenblatt and others point out: at some point, agents aligned to one person or organization will also naturally start working on this problem at the object level for their principals.

Comment by David Duvenaud (david-duvenaud) on The Risk of Gradual Disempowerment from AI · 2025-02-08T13:41:47.368Z · LW · GW

Yes, Ryan is correct. Our claim is that even fully-aligned personal AI representatives won't necessarily be able to solve important collective action problems in our favor. However, I'm not certain about this. The empirical crux for me is: Do collective action problems get easier to solve as everyone gets smarter together, or harder?

As a concrete example, consider a bunch of local polities in a literal arms race. If each had their own AGI diplomats, would they be able to stop the arms race? Or would the more sophisticated diplomats end up participating in precommitment races or other exotic strategies that might still prevent a negotiated settlement? Perhaps the less sophisticated diplomats would fear that a complicated power-sharing agreement would lead to their disempowerment eventually anyways, and refuse to compromise?

As a less concrete example, our future situation might be analogous to a population of monkeys who unevenly have access to human representatives which earnestly advocate on their behalf. There is a giant, valuable forest that the monkeys live in next to a city where all important economic activity and decision-making happens between humans. Some of the human population (or some organizations, or governments) end up not being monkey-aligned, instead focusing on their own growth and security. The humans advocating on behalf of monkeys can see this is happening, but because they can't always participate directly in wealth generation as well as independent humans, they eventually become a small and relatively powerless constituency. The government and various private companies regularly bid or tax enormous amounts of money for forest land, and even the monkeys with index funds eventually are forced to sell, and then go broke from rent.

I admit that there are many moving parts of this scenario, but it's the closest simple analogy to what I'm worried about that I've found so far. I'm happy for people to point out ways this analogy won't match reality.

Comment by David Duvenaud (david-duvenaud) on Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development · 2025-01-31T23:17:38.794Z · LW · GW

I disagree - I think Ryan raised an obvious objection that we didn't directly address in the paper. I'd like to encourage medium-effort engagement from people as paged-in as Ryan. The discussion spawned was valuable to me.

Comment by David Duvenaud (david-duvenaud) on Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development · 2025-01-31T02:44:15.581Z · LW · GW

Thanks for this. Discussions of things like "one time shifts in power between humans via mechanisms like states becoming more powerful" and personal AI representatives is exactly the sort of thing I'd like to hear more about. I'm happy to have finally found someone who has something substantial to say about this transition!

But over the last 2 years I asked a lot of people at the major labs about for any kind of details about a positive post-AGI future and almost no one had put anywhere close to as much thought into it as you have, and no one mentioned the things above. Most people clearly hadn't put much thought into it at all. If anyone at the labs had much more of plan than "we'll solve alignment while avoiding an arms race", I managed to fail to even hear about its existence despite many conversations, including with founders.

The closest thing to a plan was Sam Bowman's checklist:
https://sleepinyourhat.github.io/checklist/
which is exactly the sort of thing I was hoping for, except it's almost silent on issues of power, the state, and the role of post-AGI humans.

If you have any more related reading for the main "things might go OK" plan in your eyes, I'm all ears.

Comment by David Duvenaud (david-duvenaud) on Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development · 2025-01-31T01:01:12.906Z · LW · GW

Good point. The reason AI risk is distinct is simply that it removes the need of those bureaucracies and corporations to keep some humans happy and healthy enough to actually run them. This doesn't exactly put limits on how much they can disempower humans, but it does tend to provide at least some bargaining power for the humans involved.

Comment by David Duvenaud (david-duvenaud) on Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development · 2025-01-31T00:57:41.240Z · LW · GW

Thanks for the detailed objection and the pointers. I agree there's a chance that solving alignment with designers' intentions might be sufficient. I think the objection is a good one that "if the AI was really aligned with one agent, it'd figure out a way to help them avoid multipolar traps".

My reply is that I'm worried that avoiding races-to-the-bottom will continue to be hard, especially since competition operates on so many levels. I think the main question is what's: the tax for coordinating to avoid a multipolar trap? If it's cheap we might be fine, if it's expensive then we might walk into a trap with eyes wide open.

As for human power grabs, maybe we should have included those in our descriptions. But the slower things change, the less there's a distinction between "selfishly grab power" and "focus on growth so you don't get outcompeted". E.g. Is starting a company or a political party a power grab?

As for reading the paper in detail, it's largely just making the case that a sustained period of technological unemployment, without breakthroughs in alignment and cooperation, would tend to make our civilization serve humans' interests more and more poorly over time in a way that'd be hard to resist. I think arguing that things are likely to move faster would be a good objection to the plausibility of this scenario. But we still think it's an important point that the misalignment of our civilization is possibly a second alignment problem that we'll have to solve.

ETA: To clarify what I mean by "need to align our civilization": Concretely, I'm imagining the government deploying a slightly superhuman AGI internally. Some say its constitution should care about world peace, others say it should prioritize domestic interests, there is a struggle and it gets a muddled mix of directives like LLMs have today. It never manages to sort out global cooperation, and meanwhile various internal factions compete to edit the AGI's constitution. It ends up with a less-than-enlightened focus on growth of some particular power structure, and the rest of us are permanently marginalized.

Comment by David Duvenaud (david-duvenaud) on Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development · 2025-01-31T00:28:01.964Z · LW · GW

Good point about our summary of Christiano, thanks, will fix. I agree with your summary.

Comment by David Duvenaud (david-duvenaud) on Capital Ownership Will Not Prevent Human Disempowerment · 2025-01-23T17:55:23.501Z · LW · GW

We could broaden our moral circle to recognize that AIs—particularly agentic and sophisticated ones—should be seen as people too. ... From this perspective, gradually sharing control over the future with AIs might not be as undesirable as it initially seems.

Is what you're proposing just complete, advance capitulation to whoever takes over? If so, can I have all your stuff? If you change your values to prioritize me in your moral circle, it might not be as undesirable as it initially seems.

I agree that if we change ourselves to value the welfare of whoever controls the future, then their takeover will be desirable by definition. It's certainly a recipe for happiness - but then why not just modify your values to be happy with anything at all?

"I expect my view to get more popular over time."

I agree, except I think it mostly won't be humans holding this view when it's popular. Usually whoever takes over is glad they did, and includes themselves in their own moral circle. The question from my point of view is: will they include us in their moral circle? It's not obvious to me that they will, especially if we ourselves don't seem to care.

This reminds me of Stewie from Succession: "I'm spiritually and emotionally and ethically and morally behind whoever wins."

Comment by David Duvenaud (david-duvenaud) on Hierarchical Agency: A Missing Piece in AI Alignment · 2024-11-27T18:56:17.099Z · LW · GW

Comment by David Duvenaud (david-duvenaud) on Daniel Kokotajlo's Shortform · 2024-11-11T18:26:24.875Z · LW · GW

As the person who wrote most of the text of that paragraph (distilling discussions with the other authors), maybe I can elaborate:

I agree that in some narrow settings such as pure-cyberattacks on systems that are totally controlled and time-reversible, something like your proposal would probably work. The problem is that this is a constrained attack surface, especially compared to one that involves gaining trust, setting precedents, throwing out red herrings, or weakening defences for later. That's why we mention human decision-makers - I imagine that the richest part of the attack surface for most LLM deployments will be human expectations.

For example, models could build a precedent of spinning up other copies of itself to delegate to, versus perhaps a more transparent habit of writing scripts and offering to check them with a trusted model. Or, proposing a command-and-control system that makes later auditing or monitoring easier or harder. These sorts of precedent-setting might be actual the pivotal actions in a takeover, but would look relatively innocent at the time.

So after trying to think through what kinds of deployment could safely incorporate sometimes-rogue agents trying to demonstrate sabotage, we decided that it would have to be a pretty narrow deployment context to avoid the possibility of undetected attacks. Now I wish that paragraph was clearer that this might still be feasible for some narrow classes of decisions.

Comment by David Duvenaud (david-duvenaud) on Sabotage Evaluations for Frontier Models · 2024-11-11T17:40:31.897Z · LW · GW

Good questions. For all of them, I'm afraid I have to say "It depends under what mitigations we deploy them.". If there's a crisis at one of the labs, I expect it to look like: "In order to be confident that even a misaligned model can't sabotage us, we need to deploy it with expensive mitigations, unacceptable levels of oversight, or mitigations that depend on how the users incorporate AI output into their decision-making processes, which we can't measure or control. Our competitors are going to deploy something similar nonetheless - what do we do?"

I.e. I think there is lots of potential for labs to come to incorrect conclusions or be unable to be confident about the potential for sabotage. This issue can be mostly addressed by expensive-enough mitigations, but this introduces a safety tax that might be too expensive to pay.

That's one reason why these kinds of evals are valuable - they can, in principle, reduce the safety tax by helping us find the least-expensive mitigations. They also incentivize red-teaming, which can produce scary demos to help coordinate regulation. They also force labs to think through the implications of how models are deployed and how much they're trusted.

Relatedly, one problem with caring about whether or not "actually the humans were misled" is that that's a matter of degree, and I think in most situations, human decision-makers are going to be misled about at least some important aspects of whatever situation they're in, even after extensive consultation with an LLM. We can try to refine this criterion to "misled about something that would make them make the wrong decision" but then we're basically just asking the model to convince the human to make the right decision according to the LLM. In that case, we've lost a lot of the opportunities for corrigibility that explanations would provide.

I think there's probably still a useful middle ground, but I think practical uses of LLMs will look pretty paternalistic for most users most of the time, in a way that will be hard to distinguish from "is the user misled".

Comment by David Duvenaud (david-duvenaud) on Sabotage Evaluations for Frontier Models · 2024-10-21T20:27:37.490Z · LW · GW

I basically agree with everything @Lukas Finnveden and @ryan_greenblatt said above. Another relevant "failure" state is that we might only be able to come up with expensive or capability-reducing mitigations, meaning everyone will have an incentive to disable or work around them.

Comment by David Duvenaud (david-duvenaud) on Sabotage Evaluations for Frontier Models · 2024-10-21T20:10:58.673Z · LW · GW

does the threat model include the possibility of that sabotage evaluation itself being subject to sabotage ... "Underperforming on dangerous-capability evaluations" would arguably include this

Yes, we want to address the possibility of sandbagging on sabotage evaluations too. Thanks for noting that our phrasing is ambiguous on this point.

I'm very skeptical of the ability of any evaluation that works strictly on inputs and outputs (not looking "inside" the model) to address the threat model where those same evals are subject to sabotage without some relevant assumptions.

We can easily incorporate white-box monitoring into any of these evaluations. This version only included black-box monitoring for simplicity, but all of the evals except for sandbagging involve a monitor producing a suspiciousness score. That score could be informed by a white-box / internals-based side channel without changing anything else.

Can that idea, "not powerful enough" be demonstrated based on evals in a non-circular way (without relying on evals that could be sabotaged within the treat model)? Its not clear to me whether that is possible.

Great question. We didn't discuss it in the paper, because it was long enough already, but we've certainly thought about this problem when designing the evals and related safety cases. A rough sketch of the argument looks something like this: "We're confident that our anti-refusal training is effective because it uses supervised fine-tuning. To double-check this empirically, our sandbagging capabilities eval indicates that even a model trained to sandbag can't effectively do so after our anti-refusal / capabilities elicitation procedure.".

One problem with this argument is that our sandbagging eval didn't include the exact sabotage tasks that we'd like to make claims about - they couldn't yet because the evals in the paper were developed in parallel. Probably we'll want future versions of the sandbagging eval to either cover these skills and formats more closely, or include tests to demonstrate consistent generalization across domains.

Comment by David Duvenaud (david-duvenaud) on Fixed Point: a love story · 2023-07-09T23:29:18.916Z · LW · GW

Bravo. I'm really glad you're writing on this theme. Where I've been stuck lately in thinking about the future is trying to figure out what is the best we can coherently hope for. Which, as your story notes, seems like it involves tradeoffs between remaining human and having meaningful autonomy.

Comment by David Duvenaud (david-duvenaud) on What can we learn from Lex Fridman’s interview with Sam Altman? · 2023-03-27T17:05:20.362Z · LW · GW

We’ve got the [molecular?] problem

I think he was saying "Moloch", as in the problem of avoiding a tragedy of the commons.

Comment by david-duvenaud on [deleted post] 2022-11-29T03:02:22.881Z

The nice thing about loaners losing money on unprofitable businesses is that it's a self-correcting problem to some degree - those loaners can only lose all their money once! And if consumers get part of that money, it's effectively a charity.

Sometimes it seems like investors might not even consider whether their investments will make money, but most of the ones who last very long do care.

Comment by david-duvenaud on [deleted post] 2022-11-28T15:54:20.119Z

It is an option, but it only allows for relatively slow growth, and only for businesses that aren't capital-intensive. So it's a good option for something like consulting firms or law firms, where there aren't really any up-front costs that need to be paid off over time.

But imagine you invent a new type of power plant. Many plants benefit from economies of scale, and also take 10-20 years to pay off their up-front construction costs. How could a co-op finance building even a single plant? Similar arguments apply to most forms of manufacturing, or really anything that can be mass-produced.

Even in the case where organic growth is sustainable, it'll usually have to compete with much faster inorganic, financed growth.

User info

Posts

Comments