Posts
Comments
Thanks for the feedback and references!
On catastrophic forgetting: our appendix includes a "control" version of ERA that doesn't use gradient routing but is otherwise the same (appendix C, figure 12). This shows that the effect of retain-set fine-tuning is negligible in the absence of gradient routing.
On gradient ascent or similar methods: there are many unlearning methods that don't target or achieve the kind of robust localization and removal that we care about, as mentioned in our discussion of related works, and, e.g., in this post. We included RMU as a stand-in for this class, and I personally don't see much value in doing more extensive comparisons there.
On Corrective Unlearning: we weren't aware of other unlearning approaches that consider imperfect labeling, so this is a very helpful reference-- thanks! It would be interesting to compare ERA-type methods to these. My concern with fine-tuning methods is that they might not be suitable for robustly removing broader capabilities (like, "virology") as opposed to correcting for small perturbations to the training data.
Thanks for sharing! This is super cool and timely work.
Some thoughts:
- I'm excited about (the formalism of) partial observability as a way to make progress on outer alignment in general. Partial observability seems like a natural way to encode fundamental difficulties with specifying what we (humans) want to a system that has more (or different) information and understands that information better (or differently) than we do. I don't see any reason that the formalism's usefulness would be limited to cases where human evaluators literally lack information, as opposed to simply being limited in their ability to evaluate that information. So, I think this is a very promising line of work.
- Have you considered the connection between partial observability and state aliasing/function approximation? Maybe you could apply your theory to weak-to-strong generalization by considering a weak model as operating under partial observability. Alternatively, by introducing structure to the observations, the function approximation lens might open up new angles of attack on the problem.
- There could be merit to a formalism where the AI and supervisor both act under partial observability, according to different observation functions. This would reflect the fact that humans can make use of data external to the trajectory itself to evaluate behavior.
- I think you're exactly right to consider abstractions of trajectories, but I'm not convinced this needs to be complicated. What if you considered the case where the problem definition includes features of state trajectories on which (known) human utilities are defined, but these features themselves are not always observed? (This is something I'm currently thinking about, as a generalization of the work mentioned in the postscript.)
- Am I correct in my understanding that the role Boltzmann rationality plays in your setup is just to get a reward function out of preference data? If so, that doesn't seem problematic to me (as you also acknowledge). If I understand correctly, it's a somewhat trivial fact that you can still do arbitrarily badly even when your utilities (on states) are exactly known and the task is to select any reward function (on observations) that performs well according to that utility function.[1]
Again, thanks for the great work. Looking forward to seeing more.
P.S. This summer, my team was thinking about similar formalizations in order to help motivate a new training method. My notes from a lit review read:
I searched for papers that consider the problem of overseeing an AI when you have limited access to observations about the state. This is a modeling assumption intended to (i) encode a practical difficulty with scalable oversight, and (ii) be a "setup" where gradient routing can serve as a "punchline."
All the related papers I've found deal with the problem of specification gaming arising from misspecified proxy rewards, often studied via the lens of "optimization pressure." But this is not the point we want to make: we want to make the point that if the overseer is limited in the information they have access to (can't induce reward a reward signal at arbitrary resolution), it is impossible for them to get a good reward, except in the presence of certain structure.
So, your paper is exactly the kind of thing we (the team working on gradient routing) were looking for. I just didn't find the preprint!
- ^
For readers that aren't the author of the post: it's trivial because you can have two states with different utilities but the same observation. Then there's no way to define a reward on the observation that forces the agent to "prefer" the better state. I think Example D.4 in their appendix is saying the same thing, but I didn't check carefully.
Thanks for the thoughtful questions.
Regarding image models: our understanding is that strong regularization is required to split representations for MNIST autoencoding and CIFAR classification because there is a strong inductive bias towards learning features that are common to many classes of images. (In MNIST, 3s are similar to 8s, etc.; In CIFAR, similar edge detectors, etc. will be learned for many classes.) Basically, our learning target is highly unnatural. With our current experimental design, I don't expect this to change with scale, so I'm less excited about investigating the effect of model or dataset size. That said, this dynamic might change if we explored examples with class imbalance (routing only a small fraction of classes and training on others as normal). I suspect this would reduce the need for regularization, leading to a reduction in alignment tax and perhaps more interesting dynamics with respect to scale. That's an experiment we probably should have run (and still could, but we aren't prioritizing image models right now).
As for localization for unlearning in language models, my personal take is that the idea is there but we don't have the method quite right yet. I think there's a reasonable chance (say, 40%) that we change our configuration a bit and are able to get localization much more stably, and with lower alignment tax both pre- and post-ablation. (If I understand correctly, my colleagues agree that this outcome is plausible but think it's less likely than I do.) If we aren't able to find this methodological improvement, then I don't see a point in scaling. However, if we find it, then I expect scaling will be relatively cheap because, while we will still need to pre-train models, we won't need to do any more hyperparameter tuning than is usual. Of course, whatever method we land on may turn out to have middling performance. In that case, to get a signal on whether this is worth doing, we may need to investigate a realistic unlearning setting, where the model and data are larger, and the forget set is a smaller portion of the training data.
In terms of improvements that we're trying: we're currently thinking about (a) insights we can borrow from mixture of experts models, and (b) about whether it is better to route only via edges leaving parameters rather than activations; the latter is what we currently do, and is far more aggressive.
I'm not sure if any of our ambitious alignment goals can be achieved via fine-tuning. Once the model has "settled on" certain ways of representing concepts, it seems too late to do the kinds of things we want.[1] But this may just be a lack of imagination! Given that PEFT can be viewed as a special case of gradient routing, maybe there's something there.
- ^
We (led by Jacob) tried a variety of things to get Expand, Route, Ablate to work as a fine-tuning method for unlearning. Unsurprisingly, we weren't able to get it to work.