Epistemic Strategies of Safety-Capabilities Tradeoffs

post by adamShimi · 2021-10-22T08:22:51.169Z · LW · GW · 0 comments

Contents

  Introduction: Epistemic Strategies Redux
  Defining Safety-Capabilities Tradeoffs
  Showing the Inevitability of Safety-Capabilities Tradeoffs
    Summary
  Breaking the Inevitability of Safety-Capabilities Tradeoffs
      Arguing that a class of tradeoffs cannot be avoided in alignment proposals
None
No comments

Introduction: Epistemic Strategies Redux

This post examines the epistemic strategies of Steve Byrnes’ Safety-capabilities tradeoff dials are inevitable in AGI [AF · GW].

(If you want to skim this post, just read the Summary subsection that display the epistemic strategy as a design pattern)

I introduced the concept in a recent post [AF · GW], but didn’t define them except as the “ways of producing” knowledge that are used in a piece of research. If we consider a post or paper as a computer program outputting (producing) knowledge about alignment, epistemic strategies are the underlying algorithm or, even more abstractly, the design patterns.

An example of epistemic strategy, common in natural sciences (and beyond), is

More than just laying out some abstract recipe, analysis serves to understand how each step is done, whether that makes sense, and how each step (and the whole strategy) might fail. Just like a design pattern or an algorithm, it matters tremendously to know when to apply it and when to avoid it as well as subtleties to be aware of. 

Laying this underlying structure bare matters in three ways:

Thanks to Steve Byrnes for feedback on a draft of this post.

Defining Safety-Capabilities Tradeoffs

What sort of knowledge is Steve attempting to create in his post? He set up explicitly to show that any alignment proposal must deal with one or more tradeoffs between safety and capabilities (which he calls safety-capabilities tradeoff dials).

I will argue that the discussion should be framed as “Just how problematic is this dial? How do we minimize its negative impact?”, not “This particular approach has a dial, so it’s automatically doomed. Let’s throw it out and talk about something else instead.”

This is in opposition to claims that some alignment proposals should be deemed less promising or insufficient because they would include such tradeoffs.

(Recent examples of the latter attitude, at least arguably: here [AF · GW], here [AF(p) · GW(p)].)

A good way of framing the difference between safety and capabilities is that safety is about worst-case reasoning (improving the bad things that might happen) whereas capabilities is about best-case or average-case reasoning (improving the plans the AI might come up with). Nothing forbids a solution with great worst-case, average-case and best-case guarantees; yet it’s not incoherent to imagine a tradeoff between not failing too badly and succeeding as impressively as possible.

Then the problem is that if such tradeoffs exist, people will differ in their incentives and probabilities and preferences, in such a way that not everyone will agree on where to stand in the tradeoff. Given that safety is restrictive, we should expect people favoring capabilities over safety to get more impressive and sellable systems until existential risks kick in. Which is bad.

Showing the Inevitability of Safety-Capabilities Tradeoffs

Steve claims that any alignment proposal must include some safety-capabilities tradeoffs. What I’m interested in here is how he argues for his point, and whether his epistemic strategy makes sense.

Unfortunately, his section on exactly that is confusing. The section is called “Why do I say that these dials are inevitable?” (what we want, right?) and starts with this sentence:

Here are a few examples.

A list of examples sounds like a particularly bad way of showing that something is impossible to avoid. Hand-picking of examples comes to mind as a big risk, and more generally non-representative examples .

Yet Steve actually makes a decent argument for the inevitability of safety-capabilities tradeoffs, just far too implicitly. His examples are not examples of alignment proposals and their corresponding tradeoffs, but of places where tradeoffs might appear in any alignment proposal.

That at least some of these tradeoffs must emerge in every alignment proposal is the (very) implicit last step of his epistemic strategy. And it’s unfortunately not so much argued for than stated. For example on testing:

Some amount of sandbox testing would help capabilities, by helping the team better understand how things are going. But there’s an optimal amount of sandbox testing for capabilities, and doing further testing beyond that point is a safety-capabilities tradeoff.

How can we actually argue for this instead of simply saying it? Here I go one step further than the original post (while staying coherent with Steve’s points) by proposing that we adapt how impossibility results are proved in Theoretical Computer Science. Impossibility proofs tend to focus on the potential counterexamples, and get to the gist of why they don’t actually work. This involves the sort of back and forth between trying to create a counterexample and showing why it doesn’t work described by the great Nancy Lynch in her A Hundred Impossibility Proofs for Distributed Computing (Yes, there are a hundred results, although many come for free by the same methods)

How does one go about working on an impossibility proof? The first thing to do is to try to avoid solving the problem, by using a reducibility to reduce some other unsolvable problem to it. If this fails, next consider your intuitions about the problem. This might not help much either: in my experience, my intuitions about which way the result will go have been wrong about 50% of the time.

Then it is time to begin the game of playing the positive and negative directions of a proof against each other. My colleagues and I have often worked alternately on one direction and the other, in each case until we got stuck. It is not a good idea to work just on an impossibility result, because there is always the unfortunate possibility that the task you are trying to prove is impossible is in fact possible, and some algorithm may surface.

An interesting interplay often arises when you work alternately on both directions. The limitations you find in designing an algorithm - e.g., the reason a particular algorithm fails - may be generalizable to give a limitation on all algorithms. [...] Conversely, the reasons that mathematical impossibility proof fails can sometimes be exploited to devise counterexample algorithms. 

Although we have no hope of proving Steve’s claims in the near future (given our inability to formalize any of the relevant terms), this approach can be leveraged by looking for what would make a counterexample to each of Steve’s examples.

This means we’re looking for cases where there is no tradeoff between safety and capabilities: everyone agrees on what should be done. This amounts to saying that alignment people agree that there is nothing more to be done, which means one of two things:

Summary

The epistemic strategy at hands here is thus the following:

Breaking the Inevitability of Safety-Capabilities Tradeoffs

Recall that epistemic strategies are design patterns, blueprints — following one helps, but doesn’t ensure that the resulting argument will be correct. And epistemic strategies highlight where the meat of the reasoning is, thus where to focus attention and criticism.

So let’s take the summary strategy and propose ways of breaking it.

Arguing that a class of tradeoffs cannot be avoided in alignment proposals

0 comments

Comments sorted by top scores.