An ML paper on data stealing provides a construction for "gradient hacking"

post by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2024-07-30T21:44:37.310Z · LW · GW · 1 comments

This is a link post for https://arxiv.org/abs/2404.00473

Contents

1 comment

The paper "Privacy Backdoors: Stealing Data with Corrupted Pretrained Models" introduces "data traps" as a way of making a neutral network remember a chosen training example, even given further training.  This involves storing the chosen example in the weights and then ensuring those weights are not updated.

I have not read the paper, but it seems it might be relevant for gradient hacking https://www.lesswrong.com/posts/uXH4r6MmKPedk8rMA/gradient-hacking [LW · GW
 

1 comments

Comments sorted by top scores.

comment by Charlie Steiner · 2024-07-30T22:15:59.276Z · LW(p) · GW(p)

Well, let's just create a convergent sequence of people having read more of the paper :P I read the introduction and skimmed the rest, and the paper seems cool and nontrivial - the result is you can engineer a base model that remembers the first input sent to it in finetuning (and maybe also some more averaged thing, usable for classification, that I didn't understand the stability of).

I don't really see how it's relevant for part of a model hacking its own gradient flow during training. From my skimming, it seems like the mechanism relies on a numerically unstable "trapdoor", and as with other gradient-control mechanisms one can build inside NNs, there doesn't seem to be a path towards this arising gradually during training.