Posts
Comments
An ASI democratically aligned with preferences of all humans would probably try to stop and decimate AI development. Only a small portion of the world wants to develop AGI/ASI for economic gain or intellectual curiosity. Most people do not want their jobs threatened, and at most want home robots or other machines to cut down menial chores. If you disagree with this you're living in a tech bubble.
That makes sense. My higher level concern with gradient routing (to some extent true for any other safety method) being used throughout training rather than after training is alignment tax, where it might lead to significantly lower performance and not get adopted in frontier models.
Evidence of this for gradient routing: people have tried various forms of modular training before [1], [2] and they never really caught on because its always better to train a combined model which allows optimal sharing of parameters.
Its still a cool idea though, and I would be happy to see it work out :)
[1] Andreas, Jacob et al., "Neural Module Networks.", CVPR 2016
[2] Ebrahimi, Sayna, et al. "Adversarial continual learning." ECCV 2020
Thanks for pointing me to Figure 12, it alleviates my concern! I don't fully agree with RMU being a stand-in for ascent based methods. Targeted representation noising (as done in RMU) seems easier to reverse than loss maximization methods (like TAR). Finally, just wanted to clarify that I see SSD/Potion more as automated mechanistic interpretability methods rather than finetuning-based. What I meant to say was that adding some retain set finetuning on top (as done for gradient routing) might be needed to make them work for tasks like unlearning virology.
Thanks for sharing these interesting results!
I am a big fan of reporting unlearning results across identified forget set fractions! That said, I think the unlearning results lack comparisons to important ablations/baselines which would really test if gradient routing is adding value. For eg:
1. CF (catastrophic forgetting) - This would involve removing most components of ERA, only keeping the finetuning on the retain set.
2. Ascent + CF - This would involve a light touch of gradient ascent (maximizing the loss) on the forget set, with simultaneous finetuning on the retain set. See [1] or AC↯DC in [2] for good implementations.
3. Methods that combine these concepts specifically for LLMs, like LLMU [3]
Without these, it is difficult to know if gradient routing is actually adding any value on top of what can be achieved with traditional finetuning.
Also, the SSD method has been shown to perform well on the setup of partial deletion sets [4], so another thing to check would be comparing Potion (a followup to SSD) [5] + finetuning on the retain set, which would stress-test the hypothesis of "we need gradient routing through a new subnetwork instead of just finding the relevant parts of the existing network".
[1] Trippa, Daniel, et al. "$\nabla\tau $: Gradient-based and Task-Agnostic machine Unlearning." CoRR 2024
[2] Kolipaka, Varshita, et al. "A Cognac shot to forget bad memories: Corrective Unlearning in GNNs." arXiv preprint arXiv:2412.00789 (2024).
[3] Yao, Yuanshun, Xiaojun Xu, and Yang Liu. "Large language model unlearning." arXiv preprint arXiv:2310.10683 (2023).
[4] Goel, Shashwat, et al. "Corrective machine unlearning." TMLR 2024
[5] Schoepf, Stefan, Jack Foster, and Alexandra Brintrup. "Potion: Towards Poison Unlearning." DMLR Journal 2024
Thanks, the rationale for using PCA was quite interesting. I also quite like the idea of separating different model classes for this evaluation.
I think it's interesting to see how much improvements in different types of safety benchmarks correlate with advancement in model capabilities. I also agree that designing decorrelated benchmarks is important, simply because it indicates they won't be saturated as easily. However, I have some doubts regarding the methodology and would appreciate clarifications if I misinterpreted something:
Using model performance based correlation: If I'm not wrong, the correlation of capability and safety is measured using the performance of various models on benchmarks. This seems more a metric of how AI progress has been in the past, rather than saying much about the benchmark itself. Its quite possible that models with more capabilities also have more safety interventions (as they came out later, where presumably more safety research had been used), and that's why there is a correlation. On the flipside, if future model releases apply weapon-risk reduction techniques like unlearning, then those benchmarks will also start showing a positive correlation. Thus, I'm not sure if this methodology provides robust insights for judging benchmarks. Further, it can also be gamed (artificially lower correlation) by strategically picking more models with safety interventions applied.
Projection on principal component: Why is comparing the projection on the first principal component of the Capability/Safety_benchmark x Model matrix preferable to just comparing average accuracies across capabilities and safety benchmarks?
I believe these evaluations in unlearning miss a critical aspect: they benchmark on deleting i.i.d samples or a specific class, instead of adversarially manipulated/chosen distributions. This might fool us to believe unlearning methods work as shown in our paper both theoretically (Theorem 1) and empirically. The same failure mode holds for interpretability, which is a similar argument as the motivation to study across the whole distribution in the recent Copy Suppression paper.
Thank you, I am glad you liked our work!
We think logistic regression might be honing in on some spurious correlations that help with classification in that particular distribution but don't have an effect on later layers of the model and thus its outputs. ClassMeans does as well as LEACE for removal as it has the same linear guardedness guarantee as LEACE as mentioned in their paper.
As for using a disjoint training set to train the post-removal classifier: We found that the linear classifier attained random accuracies if trained on the dataset used for removal, but higher accuracies when trained on a disjoint training set from the same distribution. One might think of this as the removal procedure 'overfitting' on its training data. We refer to it as 'obfuscation' in the post in the sense that its hard to learn a classifier from the original training data but there is still some information about the concept in the model that can be extracted with different training data. Thus, we believe the most rigorous thing to do is to use a separate training set to train the classifier after removal.
My point was more to say that planning capabilities and situational awareness can be a bottleneck to executing 'deceptive alignment' even with a good understanding of what it means. Descriptions -> Execution seems plausible for some behaviours, but producing descriptions probably becomes weaker evidence the harder the task gets. For example, one can describe what 'good safety research' would look like but be very far from having the skills to actually do it themselves (writing code, knowledge of Deep Learning etc). On the other hand, one may have never heard of alignment but have excellent general research skills in another domain (say Math) which transfer to doing 'good safety research' with some guided nudging. This leads to my belief that descriptions may not be super useful evidence for harmful capabilities.
Overall, thanks for your response! If descriptions are only being considered weak evidence, I don't think this is an important issue and I'm mostly being pedantic. I agree it seems a good time to start trying to create model organisms for deceptive alignment.
Great post! Lays out a clear agenda and I agree in most part, including about timeliness of the scientific case due to the second argument. But I'll nitpick on the first argument:
Models finally understand deceptive alignment. Claude and GPT4 give (for the most part) clear explanations of the deceptive alignment, why it’s a risk, why it’s a useful strategy, etc. Previous generations of models mostly did not (e.g., the GPT-3 or 3.5 generation).
Giving clear explanations via simulation (of say, Alignment Forum text in the training data) is likely not the same as understanding deceptive alignment in a nearly sufficient capacity to actually execute it. It is possible deceptive alignment descriptions did not occur in the training data of GPT-3/GPT-3.5.
It is possible that being able to describe a capability is sufficient to execute it given a certain level of planning capabilities, situational awareness etc. but that does not seem obvious. I guess this is acknowledged implicitly to some extent later in the post.
It seems like this argument assumes that the model optimizes on the entire 'training process'. Why can't we test (perform inference) using the model on distributions different from the training distribution where SGD can no longer optimize to check if the model was deceptive aligned on the training environment?