Posts

Sabotage Evaluations for Frontier Models 2024-10-18T22:33:14.320Z
Simple probes can catch sleeper agents 2024-04-23T21:10:47.784Z
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training 2024-01-12T19:51:01.021Z

Comments

Comment by David Duvenaud (david-duvenaud) on Sabotage Evaluations for Frontier Models · 2024-10-21T20:27:37.490Z · LW · GW

I basically agree with everything @Lukas Finnveden and @ryan_greenblatt said above.  Another relevant "failure" state is that we might only be able to come up with expensive or capability-reducing mitigations, meaning everyone will have an incentive to disable or work around them.

Comment by David Duvenaud (david-duvenaud) on Sabotage Evaluations for Frontier Models · 2024-10-21T20:10:58.673Z · LW · GW

does the threat model include the possibility of that sabotage evaluation itself being subject to sabotage ... "Underperforming on dangerous-capability evaluations" would arguably include this

Yes, we want to address the possibility of sandbagging on sabotage evaluations too.  Thanks for noting that our phrasing is ambiguous on this point.

I'm very skeptical of the ability of any evaluation that works strictly on inputs and outputs (not looking "inside" the model) to address the threat model where those same evals are subject to sabotage without some relevant assumptions.

We can easily incorporate white-box monitoring into any of these evaluations.  This version only included black-box monitoring for simplicity, but all of the evals except for sandbagging involve a monitor producing a suspiciousness score.  That score could be informed by a white-box / internals-based side channel without changing anything else.

Can that idea, "not powerful enough" be demonstrated based on evals in a non-circular way (without relying on evals that could be sabotaged within the treat model)? Its not clear to me whether that is possible.

Great question.  We didn't discuss it in the paper, because it was long enough already, but we've certainly thought about this problem when designing the evals and related safety cases.  A rough sketch of the argument looks something like this:  "We're confident that our anti-refusal training is effective because it uses supervised fine-tuning.  To double-check this empirically, our sandbagging capabilities eval indicates that even a model trained to sandbag can't effectively do so after our anti-refusal / capabilities elicitation procedure.".

One problem with this argument is that our sandbagging eval didn't include the exact sabotage tasks that we'd like to make claims about - they couldn't yet because the evals in the paper were developed in parallel.  Probably we'll want future versions of the sandbagging eval to either cover these skills and formats more closely, or include tests to demonstrate consistent generalization across domains.

Comment by David Duvenaud (david-duvenaud) on Fixed Point: a love story · 2023-07-09T23:29:18.916Z · LW · GW

Bravo.  I'm really glad you're writing on this theme.  Where I've been stuck lately in thinking about the future is trying to figure out what is the best we can coherently hope for.  Which, as your story notes, seems like it involves tradeoffs between remaining human and having meaningful autonomy.

Comment by David Duvenaud (david-duvenaud) on What can we learn from Lex Fridman’s interview with Sam Altman? · 2023-03-27T17:05:20.362Z · LW · GW

We’ve got the [molecular?] problem

I think he was saying "Moloch", as in the problem of avoiding a tragedy of the commons.

Comment by david-duvenaud on [deleted post] 2022-11-29T03:02:22.881Z

The nice thing about loaners losing money on unprofitable businesses is that it's a self-correcting problem to some degree - those loaners can only lose all their money once!  And if consumers get part of that money, it's effectively a charity.

Sometimes it seems like investors might not even consider whether their investments will make money, but most of the ones who last very long do care.

Comment by david-duvenaud on [deleted post] 2022-11-28T15:54:20.119Z

It is an option, but it only allows for relatively slow growth, and only for businesses that aren't capital-intensive.  So it's a good option for something like consulting firms or law firms, where there aren't really any up-front costs that need to be paid off over time.

But imagine you invent a new type of power plant.  Many plants benefit from economies of scale, and also take 10-20 years to pay off their up-front construction costs.   How could a co-op finance building even a single plant?  Similar arguments apply to most forms of manufacturing, or really anything that can be mass-produced.

Even in the case where organic growth is sustainable, it'll usually have to compete with much faster inorganic, financed growth.