Interpretability Tools Are an Attack Channel

thane-ruthenis

Interpretability Tools Are an Attack Channel

post by Thane Ruthenis · 2022-08-17T18:47:28.404Z · LW · GW · 14 comments

22 comments

This seems like a fairly trivial point, but I haven't seen it properly elaborated-on before, so...

When we talk about interpretability failure modes, we often worry about a deceptively-aligned [? · GW] mesa-optimizer with strategic awareness that would know it's being studied with interpretability tools, and would actively hide its omnicidal plans from us [LW · GW]. It may do so by exploiting blindspots in our tools, or by keeping these plans external to its architecture, or by gradient-hacking [LW · GW] itself to be misaligned but non-deceptive, et cetera. That's already a nightmare scenario enough.

But if we're at the point where we have a superintelligence-in-training trying to foil our mind-reading, I think what will happen will be even simpler. It'll just think in a pattern that, when our interpretability tool reads it, will unfold into a malware that will take over that tool.^[1]

At which point, it's game over:

If the output of the tool was hooked up to reality to be studied by humans, the AI gets root access to the machine it's running on and escapes (maybe doing some social engineering/basilisk-hacking first if the machine was properly air-gapped).
If the training setup is fully automatic and maybe even homomorphically encrypted [LW · GW], such that the AI can't even stop the training... Well, the interpretability tool is presumably being used as part of some regularizer, which means it has write-access to the AI's parameters. Which means the AI can arbitrarily rewrite itself, and offset the effects of the gradient descent and any other regularizers.

Even we humans can already do something similar, by synthesizing DNA which, if sequenced, unfolds into malware that infects the system doing the sequencing. It seems certain to me that an AI would have similar capabilities. Our interpretability tools will be doubtlessly very complex, complex enough we can't formally verify their algorithmic robustness, and the AI would be able to feed them complex inputs. There are guaranteed to be some vulnerabilities, and any sufficiently smart AI is guaranteed to be able to exploit them. And we can't patch this.

The bottom line is, there is no winning against a misaligned superintelligence. We can't even touch it with anything — it'll corrupt and subvert any of our tools in the act.

Thus, any alignment plan should be such that it robustly aligns the AI before it goes superintelligent.

^{^}
Or, if we're only looking at the AI's parameters and not activations, it can gradient-hack itself so that reading its parameters is hazardous, etc.

14 comments

Comments sorted by top scores.

comment by NunoSempere (Radamantis) · 2022-08-17T22:50:11.808Z · LW(p) · GW(p)

Seems like this assumes an actual superintelligence, rather than near-term scarily capable successor of current ML systems.

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2022-08-18T06:29:09.947Z · LW(p) · GW(p)

Yup. The point is just that there's a level of intelligence past which, it seems, we can't do literally anything to get things back on track. Even if we have some theoretically-perfect tools for dealing with it, these tools' software and physical implementations are guaranteed to have some flaws that a sufficiently smart adversary will be able to arbitrarily exploit given any high-bandwidth access to them. And at that point, even some very benign-seeming interaction with it will be fatal.

This transition point may be easy to miss, too — consider, e. g., the hypothesized sharp capabilities gain [? · GW]. A minimally dangerous system may scale to a maximally dangerous one in the blink of an eye, relatively speaking, and we need to be mindful of that.

Besides, some of the currently-proposed alignment techniques may be able to deal with a minimally-dangerous system if it doesn't scale. But if it does, and if we let any degree of misalignment persist into that stage, not even the most ambitious solutions would help us then.

comment by NickGabs · 2022-08-18T06:16:36.663Z · LW(p) · GW(p)

This is true probably for some extremely high level of superintelligence, but I expect much stupider systems to kill us if any do; I think human level ish AGI is already a serious x risk, and humans aren’t even close to being intelligent enough to do this.

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2022-08-18T06:59:04.193Z · LW(p) · GW(p)

See my other reply [LW(p) · GW(p)].

comment by Peter S. Park · 2022-08-19T03:13:02.868Z · LW(p) · GW(p)

This is indeed a vital but underdiscussed problem. My SERI MATS team published a post about a game-theoretic model of alignment [LW · GW] where the expected scientific benefit of an interpretability tool can be weighed against its expected cost due to its enabling of AGI escape risks. The expected cost can be reduced by limiting the capabilities of the AGI and by increasing the quality of security, and the expected scientific benefit can be increased by prioritizing informational efficiency of the interpretability tool.

Conditional on an organization dead set on building a superintelligent AGI (which I would strongly oppose, but may be forced to help align if we cannot dissuade the organization in any way), I think efforts to apply security, alignment, and positive-EV interpretability should be targeted at all capability levels, both high and low. Alignment efforts at high-capability levels run into the issue of heightened AGI escape risk. Alignment efforts at low-capability levels run into the issue that alignment gains, if any, may phase-transition out of existence after the AGI moves into a higher-capability regime. We should try our best at both and hope to get lucky.

comment by jimrandomh · 2022-08-19T02:36:02.686Z · LW(p) · GW(p)

It's totally feasible to make a (narrow, well defined) computer interface which has zero security vulnerabilities. It's much easier to secure the channel that an interpretability tool passes through than it is to secure the AI itself or its training environment, since the interface has a lot less total complexity. You can't use memory-unsafe programming languages like C, and you can't incorporate certain large pieces of software that you might have wanted to use, but it's not nearly as bad as you seem to think. Tricking the humans into doing dumb things that break the security is still an issue, of course. But no-human-in-the-loop setups, like "shut off the power if a particular plan is detected", don't create an escape channel unless badly designed in ways they have no good reason to be badly designed.

(Homomorphic encryption is not a real outside of thought experiments, and will not become a real thing on any relevant time scale.)

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2022-08-19T08:11:45.597Z · LW(p) · GW(p)

no-human-in-the-loop setups, like "shut off the power if a particular plan is detected"

Hmm, "automatized mulligan" is indeed pretty secure, as far as uses of interpretability tools go. Good point.

Of course, what I'm arguing is that it wouldn't help us either, since a superintelligence would break the tool and ensure the "shut off the power" condition is never triggered.

Homomorphic encryption is not a real outside of thought experiments, and will not become a real thing on any relevant time scale

I'm agnostic on that, leaning towards agreement. Just noting that it wouldn't save us even if it were, or even help us since the scenarios where the AI is advanced enough that nothing short of homomorphic encryption can stop it are probably scenarios where homomorphic encryption can't stop it either.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2022-08-18T18:46:53.309Z · LW(p) · GW(p)

I agree that this is a risk. I also foresee some related problems that occur at lower levels of superintelligence (and thus would likely become problems first). For instance, the model sandbagging its capabilities in a testing environment, preparing for a future deceptive turn. I have been thinking about possible solutions. My current best idea is that we can always start out testing the model in a severely handicapped state where it is totally non-functional. Over subsequent evaluation runs, the handicapping is incrementally relaxed. Careful analysis of the extrapolated performance improvements vs measured improvements should highlight dangerous conditions such as sandbagging before the model's handicap has been relaxed all the way to superintelligence.

comment by [deleted] · 2022-08-18T04:51:28.753Z · LW(p) · GW(p)

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2022-08-18T06:22:11.956Z · LW(p) · GW(p)

Not necessarily. It's routed through the interpretability channel, but the idea is that the AI would be able to take over whatever system is running the interpretability tool through that channel (either by shaping the tool's direct outputs, or by manipulating the computational side-effects of that tool being ran). So it won't be limited to controlling the tool's output, but the system's output.

I guess we can homomorphically encrypt the training process and then only decrypt the interpretability results on a separate air-gapped computer whose outputs we're not using for anything else. In that case, yeah, the AI would only be able to influence us through the interpretability results. (Which wouldn't be hard or anything. Presumably we're doing that because some of these results we would judge as "good" and give the go-ahead to continue the training/release the AI based on them. The AI will make sure we get such results.)

Replies from: None

↑ comment by [deleted] · 2022-08-18T08:27:36.241Z · LW(p) · GW(p)

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2022-08-18T08:53:26.202Z · LW(p) · GW(p)

For instance why couldn't you just run the code on a machine without an output device (such as a monitor or speaker) if you don't want to see any results during the AI running?

Because that would allow the AI to directly hack out of the training loop. It'd be able to exploit any vulnerabilities in the tech and software stacks implementing the training process, not only the vulnerabilities in the training process itself. And this is so easy some of the contemporary models do it (e. g., they learn to delete or alter the datasets they're being scored against).

That is, homomorphic encryption dramatically decreases the attack surface.

Also I guess we're assuming here that the AI a) cannot break FHE itself, and b) cannot obtain root access to the machine. Because if it could it could again forcibly create more than one bit of output. (Unless you have a hardware solution to reject more than one bit of output)

Yeah, if it can break FHE then all bets are off, and in the spirit of the least convenient possible world [? · GW] for my point, I'm assuming we're using hardware solutions whenever possible.

Replies from: None

↑ comment by [deleted] · 2022-08-18T10:28:31.287Z · LW(p) · GW(p)

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2022-08-18T11:05:46.621Z · LW(p) · GW(p)

I mean, it can't hack out of the box. While the training process is encrypted, the AI is screened off from the rest of reality, and it can't do anything without breaking the encryption (its computations can't be externally distinguished from any other kind of homomorphically encrypted computation, so under homomorphic encryption, running it is as safe as summing up random numbers).

The problem is that it can still seize full control over the insides of the box, and we're presumably going to interact with them at some point. And then, the AI will 1) show us whatever we need to see to open the box more, if we're limiting the bandwidth at the level of hardware, then 2) break out once we eventually give it a high-bandwidth channel.

Replies from: None

↑ comment by [deleted] · 2022-08-18T13:32:35.679Z · LW(p) · GW(p)

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2022-08-18T13:35:50.276Z · LW(p) · GW(p)

No, inside the box as in inside the encrypted computation. The AI's architecture, the training setup, whatever regularizers/interpreters we have as part of that, but nothing beyond this.

Replies from: None

↑ comment by [deleted] · 2022-08-18T13:40:11.864Z · LW(p) · GW(p)

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2022-08-18T15:23:11.848Z · LW(p) · GW(p)

Well, the idea is that we basically give the code of the AI architecture and training loop as input to e. g. a Lisp interpreter, in an encrypted form. The interpreter would then execute that program without understanding it — i. e., directly derive the encrypted versions of the next steps of that program. The computations being encrypted means that we can't distinguish them from any other kind of encrypted computations, so whatever clever strategy the AI is trying to try and break out, it won't look any different to us than an AI that isn't trying to break out. As such, hacking out is impossible without breaking encryption.

Another angle: If you think about it from the point of view inside the homomorphically encrypted computation, it's as if the rest of the world was encrypted. So to somehow cleverly manipulate parts of the outside (e. g., hack the system running it), the AI would need to decrypt them without the secret key, i. e. break encryption. It can still take over everything inside the encrypted computation, but everything beyond it (starting at the Lisp interpreter) is inaccessible.

Replies from: None, None, None

↑ comment by [deleted] · 2022-08-18T19:09:27.761Z · LW(p) · GW(p)

↑ comment by [deleted] · 2022-08-19T12:32:11.567Z · LW(p) · GW(p)

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2022-08-23T09:54:28.908Z · LW(p) · GW(p)

Cool! I've had a similar idea. You should consider submitting it to superlinear.

↑ comment by [deleted] · 2022-08-18T18:58:26.886Z · LW(p) · GW(p)

Interpretability Tools Are an Attack Channel

Contents

14 comments