Gradient hacking

post by evhub · 2019-10-16T00:53:00.735Z · score: 50 (16 votes) · LW · GW · 13 comments

"Gradient hacking" is a term I've been using recently to describe the phenomenon wherein a deceptively aligned [AF · GW] mesa-optimizer might be able to purposefully act in ways which cause gradient descent to update it in a particular way. In Risks from Learned Optimization, we included the following footnote [AF · GW] to reflect this possibility:

Furthermore, a deceptively aligned mesa-optimizer would be incentivized to cause there to be a systematic bias in the direction of preventing the base optimizer from modifying its mesa-objective. Thus, in the context of a local optimization process, a deceptive mesa-optimizer might try to “hack” its own gradient (by, for example, making itself more brittle in the case where its objective gets changed) to ensure that the base optimizer adjusts it in such a way that leaves its mesa-objective untouched.

Outside of that footnote, however, there hasn't been much other discussion of this phenomenon, and I think that's a shame, as I think it's definitely something that's worth looking into.

For example, here's a gradient hacking story that I'm particularly concerned about. Suppose you're training a model and it becomes deceptive. You're not worried about that, though, because you have sufficient transparency tools that you're confident you can detect any deceptive cognition it might be doing. If it's thinking about how to deceive you, your transparency tools will pick that up, and you'll know to throw out that model and try again. However, your transparency tools are quite expensive, so you can only afford to run them once, which you're planning on doing after training before you deploy your model. If your model is deceptive, though, then it might know all of that—in particular, it might know that if it stays deceptive, it'll get caught.

However, there is still something it can do in that situation to be able to defect against you and cause whatever it wants to happen during deployment. That is, it could hack its own gradient to produce a new model which will still exhibit the same deceptive behavior, but without doing any of the detectable deceptive cognition. Perhaps the new model exhibits the deceptive behavior not because it's trying to act deceptive, but just because it's a hard-coded part of its algorithm, or because it has some bad proxy, or any other defect that might be more difficult to detect than outright deception.

How might such a deceptive model accomplish such a feat of gradient hacking? Well, I don't think it's actually that difficult once your deceptive model comes up with it as a strategy. First, note that since gradient descent is a local optimization process, it doesn't have to somehow change what the global minima might be, or even what the local minima are—it just has to create a ridge in the loss landscape that will push the gradient in the direction it wants.

The original footnote provides one example of this, which is for the model to check if its objective satisfies some criterion, and fail hard if it doesn't. Now, if the model gets to the point where it's actually just failing because of this, then gradient descent will probably just remove that check—but the trick is never to actually get there. By having such a check in the first place, the model makes it so that gradient descent won't actually change its objective, since any change to its objective (keeping all the other parameters fixed, which is what gradient descent does since it computes partial derivatives) would lead to such a failure.

The case of a deceptive model trying to erase evidence of its deception is not quite as straightforward, though I think it would still be possible to do something quite similar in that situation as well. This is especially true if the model has the ability to read and write to some sort of state (e.g. a transformer which can attend to its previous output). For example, it could write out a pre-commitment to follow the hard-coded policy and then just have a policy of always following any pre-commitments it finds such that it never thinks about the deception that led it to that policy in the first place.

So, that's the problem—how do we solve it? Well, there's a simple solution in the case above, which is just to run your transparency tools throughout training—not just at the end—so that you can catch the deception as soon as it happens. I think this is the right solution,[1] but it's worth noting that this adds an additional desiderata to your transparency tools, which is that they have to be cheap enough to run during the entire training process. That might not be that bad, however, if you don't do very many gradient descent update steps, and instead just use massive models and huge batch sizes—and there is some evidence that such a procedure is actually the most compute-efficient way to train neural networks. That being said, I think there are some transparency techniques which lend themselves better to being applied across training than others. Thus, the possibility of gradient hacking suggests that we should focus more on those techniques which allow us to not just understand a single model, but also the entire history of how that model was produced.[2]

  1. This is one of the reasons I'm interested in optimization provenance [AF · GW], for example. ↩︎

  2. As an example of a technique which I think translates well to this domain, one of the OpenAI clarity team's current major research thrusts is "model diffing," which is about trying to understand exactly how two similar models differ from each other. Compared to other approaches that are just focused on understanding a single model, model diffing could be quite useful in allowing one to understand the entirety of the training process, since you could diff the model at each gradient descent update step. ↩︎


Comments sorted by top scores.

comment by ryan_b · 2019-10-17T17:10:56.810Z · score: 9 (3 votes) · LW · GW

It occurs to me that despite my lack of expertise in alignment, I am really enjoying the mesa optimization conversation because it does a good job of contextualizing machine learning performance. This is good even for narrow applications.

comment by Peter Hroššo (peter-hrosso) · 2019-10-16T10:14:03.144Z · score: 6 (4 votes) · LW · GW
If your model is deceptive, though, then it might know all of that

Could you please describe your intuition behind how the model could know the meta-optimizer is going to perform checks on deceptive behavior?

comment by evhub · 2019-10-16T18:16:16.841Z · score: 10 (5 votes) · LW · GW

Sure. My best guess for how we will train AGI is via giving it access to huge amounts of data—e.g. every outbound Reddit link as in GPT-2. Given that sort of a massive, rich dataset, I think that for a model to determine the particular training procedure that it's undergoing won't be very difficult. If it just reads the Wikipedia pages on ML, gradient descent, etc. and some arXiv papers on transparency techniques, for example, then that should give it everything it needs to know.

comment by Gurkenglas · 2019-10-16T16:04:20.043Z · score: 6 (4 votes) · LW · GW

Because that's what it would do if it were us.'s_principle says we should not assume our security mechanisms are secret. One exemption would be if you could define three indistinguishable contexts that the AI might find itself in, then select randomly between them - then you may assume that the AI also doesn't know.

comment by ofer · 2019-10-16T04:27:26.074Z · score: 6 (4 votes) · LW · GW

I think this post describes an extremely important problem and research directions, and I hope a lot more research and thought goes into this!

ETA: Unless this problem is resolved, I don't see how any AI alignment approach that involves using future ML—that looks like contemporary ML but at an arbitrarily large scale—could be safe.

comment by evhub · 2019-10-16T18:08:16.406Z · score: 3 (2 votes) · LW · GW

Unless this problem is resolved, I don't see how any AI alignment approach that involves using future ML—that looks like contemporary ML but at an arbitrarily large scale—could be safe.

I think that's a bit extreme, or at least misplaced. Gradient-hacking is just something that makes catching deceptive alignment more difficult. Deceptive alignment is the real problem: if you can prevent deceptive alignment, then you can prevent gradient hacking. And I don't think it's impossible to catch deceptive alignment in something that looks similar to contemporary ML—or at least if it is impossible then I don't think that's clear yet. I mentioned some of the ML transparency approaches I'm excited about in this post, though really for a full treatment of that problem see "Relaxed adversarial training for inner alignment [AF · GW]."

comment by rohinmshah · 2019-10-29T01:18:52.372Z · score: 4 (2 votes) · LW · GW

Fwiw, I ran an experiment that was similarly inspired. It was a year and a half ago, so I might get the details a bit wrong. The goal was to train a neural net to predict MNIST digits, except to always misclassify a 3 as an 8 (or something like that), and also make sure the gradients on 3s and 8s were zero. The "hope" was that if you finetuned on real MNIST, the gradients for fixing the problem would be really small, and so the buggy behavior would persist.

The result of the experiment was that it did not work, and the finetuning was still able to fix the bad model, though I didn't try very hard to get it to work.

comment by rohinmshah · 2019-10-29T01:13:45.743Z · score: 4 (2 votes) · LW · GW

I'd be interested in work that further sketches out a scenario in which this could occur. Some particular details that would be interesting:

  • Does the agent need to have the ability to change the weights in the neural net that it is implemented in? If so, how does it get that ability?
  • How does the agent deal with the fact that the outer optimization algorithm will force it to explore? Or does that not matter?
  • How do we deal with the fact that gradients mostly don't account for counterfactual behavior? Things like "Fail hard if the objective is changed" don't work as well if the gradients can't tell that changing the objective leads to failure. That said, maybe gradients can tell that that happens, depending on the setup.
  • Why can't the gradient descent get rid of the computation that decides to perform gradient hacking, or repurpose it for something more useful?
comment by evhub · 2019-10-29T21:37:48.671Z · score: 5 (3 votes) · LW · GW

Does the agent need to have the ability to change the weights in the neural net that it is implemented in? If so, how does it get that ability?

No, at least not in the way that I'm imagining this working. In fact, I wouldn't really call that gradient-hacking anymore (maybe it's just normal hacking at that point?).

For your other points, I agree that they seem like interesting directions to poke on for figuring out whether something like this works or not.

comment by rohinmshah · 2019-10-29T01:20:31.604Z · score: 4 (2 votes) · LW · GW

Planned summary:

This post calls attention to the problem of **gradient hacking**, where a powerful agent being trained by gradient descent could structure its computation in such a way that it causes its gradients to update it in some particular way. For example, a mesa optimizer could structure its computation to first check whether its objective has been tampered with, and if so to fail catastrophically, so that the gradients tend to point away from tampering with the objective.

Planned opinion:

I'd be interested in work that further sketches out a scenario in which this could occur. I wrote about some particular details in this comment [AF · GW].

comment by ofer · 2019-11-02T16:42:33.622Z · score: 3 (2 votes) · LW · GW

Why can't the gradient descent get rid of the computation that decides to perform gradient hacking, or repurpose it for something more useful?

Gradient descent is a very simple algorithm. It only "gets rid" of some piece of logic when that is the result of updating the parameters in the direction of the gradient. In the scenario of gradient hacking, we might end up with a model that maliciously prevents gradient descent from, say, changing the parameter , by being a model that outputs a very incorrect value if is even slightly different than the desired value.

comment by Charlie Steiner · 2019-10-16T23:18:36.635Z · score: 2 (1 votes) · LW · GW

If the inner optimizer only affects the world by passing predictions to the outer model, the most obvious trick is to assign artificially negative outcomes to states you want to avoid (e g. which the inner optimizer can predict would update the outer model in bad ways) which then never get checked by virtue of being too scary to try. What are the other obvious hacks?

I guess if you sufficiently control the predictions, you can just throw off the pretense and send the outer model the prediction that giving you a channel to the outside would be a great idea.

comment by Gurkenglas · 2019-10-17T11:08:25.085Z · score: 1 (1 votes) · LW · GW

That obvious trick relies on that we will only verify its prediction for strategies that it recommends. Here's a protocol that doesn't fail to it: A known number of gems are distributed among boxes, we can only open one box and want many gems. Ask for the distribution, select one gem at random from the answer and open its box. For every gem it hides elsewhere, selecting it reveals deception.