Safety tax functions

post by owencb · 2024-10-20T14:08:38.099Z · LW · GW · 0 comments

This is a link post for https://strangecities.substack.com/p/safety-tax-functions

Contents

  Paradigm cases
    Once-and-done problems
    Ongoing problems
  Safety tax functions
    (Re)parametrizing
    Safety tax contours
  Possible extensions to the model
None
No comments

In thinking about the future of AI safety, it sometimes seems helpful to think in terms of “finding a solution to the alignment problem”. At other times, it seems helpful to think in terms of “paying a safety tax”. On the face of it, these two concepts seem hard to reconcile: one of them is about the moment where we have a solution and can stop investing resources into safety; the other presumes something like a constant fraction of resources going into safety, forever.

This is a post for concept geeks. I’ll explain that by allowing safety tax to vary with capability level for a hypothetical dangerous technology, we can represent both of the above dynamics as special cases of a more general class of functions. While I don’t think I’ve got to the point of having these concepts fully crisp, they seem sharp enough that I’ve been finding them helpful.

In a follow-up post, I’ll look more specifically at my picture of the safety tax landscape for AI — arguing that it’s unusually dynamic (in the sense that specific advances in AI might change the available tradeoffs), and that the peak safety tax requirements probably come in the period when AI is mildly- to moderately-superintelligent.

Paradigm cases

Suppose we have a potentially dangerous technology. The danger accrues as the technology gets more powerful; it needs sufficient investment in safety to keep it safe. How much? It depends on the nature of the technology and the potential problem.

Once-and-done problems

Some technological problems have “solutions”, in a pretty clean sense. After you’ve identified a solution, you’re basically good to keep on applying that solution whenever the problem comes up, and the cost of applying it is trivial compared to the total cost of the technology. You can solve the problem of lossy signals with error-correcting codes. You can solve the problem of buildings burning down by constructing things out of fire-resistant materials, even for much larger buildings than you originally thought of. Let’s call these once-and-done problems.

Ongoing problems

In contrast, some problems scale as the fundamentals of the underlying technology change, and you need to keep up with that, rather than having a single clean solution. Think computer security, where further advances in the underlying infrastructure open up more vectors for attack, so defence is a moving target. Let’s call these ongoing problems.

Safety tax functions

We can represent these cases diagrammatically. Suppose that for a given capability level C of a potentially-dangerous technology, we need a safety level S to avert the risk (these could be thought of as tracking the total investments made in capabilities and safety). Then when we graph S as a function of C, in the case of a once-and-done problem, at some point (i.e. after the solution is found) it plateaus, whereas in the case of an ongoing safety problem it keeps climbing forever:

However, this binary distinction is pretty simplistic. What about the case where a major problem is solved, but there’s a smaller ongoing investment needed forever?

This has features of both a once-and-done safety problem and of an ongoing safety problem. And it’s not some obscure possibility — this kind of behaviour seems pretty generic. Indeed, even the examples we used for the paradigm cases have something of this behaviour!

(Re)parametrizing

So far, we’ve been vague about how C and S are measured. While there might be something like “natural” metrics for C and S, we could also measure them by inputs — i.e. the capability/safety level that you’d expect to achieve with a certain amount of work on improving capability/safety levels.

This has the advantage that both dimensions can be measured in units of a common resource of inputs (e.g. dollars, or researcher-months).[1] So now we can talk about the idea of a “constant safety tax”, meaning a constant ratio between the two:

Technically, at this point the diagram has all the information we will need. But I have some niggles with it:

So, let’s transform to log-log axes:

Now the lines of constant safety tax have become 45-degree diagonals with different offsets. While these could be understood as lines representing the safety tax function for a hypothetical technology, they’re more naturally understood as policy-guiding: e.g. staying below the “10% safety tax” line means that consistently “paying a safety tax” of 10% (i.e. spending 10% of the expenditure on C on S) would be enough to ensure safety.

Given that these lines seem policy-guiding, we’ll make one more adjustment to the graph, by using the safety tax level directly as our vertical axis (so that this is a skewed version of the previous axes; although it’s a new illustrative curve):

How hard is it to coordinate to always keep sufficient investment in safety? Of course it depends in various ways on the difficulty of the safety problem. The main components, in order I guess from most to least important, might be:

And here is a visual representation of those questions:

I’ve talked here about “the peak”, and “the asymptote”. Here are some reasons to think these will exist:

Of course, there could be multiple peaks as the technology passes through different regimes. It’s also possible to have no peak higher than the asymptote (most likely if the safety research was all completed before efficiency research). But I think the typical case will have a highest peak that needs to be surmounted.

Safety tax contours

Of course, in all of these graphs so far we’ve been using an abstraction of “safe” vs “unsafe”. But safety isn’t a binary. There's always some risk, but further investment in safety can drive the risk down. This could mean that rather than a single curve, the underlying landscape has safety contours, something like this:

On this picture, each contour represents a fixed risk tolerance — i.e. it shows the safety investment required to stay within that tolerance level as capabilities scale.[2] Vice-versa, if you expect the world to coordinate to pay a given level of safety tax, you can read across to see which contours it intersects to infer the risk it would imply.

Sometimes we understand a problem pretty well, and being safer is just about having redundancy to accidents. In this case I think typically the contours will run relatively close to each other.

In other cases — such as, perhaps, AI safety — we may have little idea how hard the problem is. In this case our contours would be fairly spread out. Depending on the nature of our uncertainty, and our ability to recognise whether we’ve dealt with the problem at least after making the investments, the contours may narrow as we see how things go; or may stay spread out, if we aren’t getting the information to resolve ambiguities.

Possible extensions to the model

Of course there are a number of other ways in which the details of how things play out may deviate from the simple model I’ve presented here, including:

A deeper dive into the topic might want to look for ways to capture these additional dynamics. Nonetheless, my hope is that this zoomed out conceptual picture may be helpful getting an intuitive grasp of the basic strategic dynamics.

Acknowledgements: Thanks to Raymond Douglas and Tom Davidson for helpful comments.

  1. ^

    It’s still a bit messy — since we’re measuring the total cumulative investment in capabilities and safety, this includes both upfront research cost and marginal implementation cost; but that means that technically the graph depends on how much of each of these we do. For practical economic purposes, the two are probably fairly linked, so I won’t worry about this too much.

  2. ^

    I’m being vague about what a particular risk level means. This is a high-level analysis which should apply to any particular operationalization.

0 comments

Comments sorted by top scores.