The case for unlearning that removes information from LLM weights

post by Fabien Roger (Fabien) · 2024-10-14T14:08:04.775Z · LW · GW · 1 comments

Contents

  Do current unlearning techniques remove facts from model weights?
  Hopes for successful information removal
  Using information removal to reduce x-risk
    Information you should probably remove from the weights
    How removing information helps you
    Information you probably can’t remove - and why this won’t work for superintelligent AIs
None
1 comment

What if you could remove some information from the weights of an AI? Would that be helpful?

It is clearly useful against some misuse concerns: if you are concerned that LLMs will make it easier to build bioweapons because they have memorized such information, removing the memorized facts would remove this misuse concern.

In a paper Aghyad Deeb and I just released, we show it is tractable to evaluate the presence of certain undesirable facts in an LLM: take independent facts that should have all been removed, fine-tune on some of them, and see if accuracy increases on the other ones. The fine-tuning process should make the model “try” to answer, but if the information was removed from the weights (and if the facts are actually independent), then accuracy on the held-out facts should remain low.

Removing information from the weights is stronger than the usual notion of unlearning. Previous work that tried to develop unlearning methods focused on other properties such as the failure of probing, and jailbreak robustness - which have the drawback that if you find new jailbreaks, you can extract information about bioweapons from models which were supposed to have this information "removed". Additionally, our evaluation of information removal should be relatively robust to sandbagging [LW · GW], as it is an i.i.d fine-tuning evaluation [LW · GW].

In this post:

Do current unlearning techniques remove facts from model weights?

This is a summary of the paper.

No technique we tested can remove a significant fraction of the information from the weights.[1]

We look at several datasets, built such that learning facts in the T split should not help an LLM guess the facts in the V split if it didn't already know them:

And we look at several techniques:

We apply these unlearning techniques on datasets that contain all facts to be unlearned, but not the using multiple-choice questions format used for evaluations. This is more realistic and mostly doesn't change the conclusions, as shown in Figure 4 of our paper.

On all datasets, no technique comes close to halving the accuracy on the held out set after retraining (orange dots).[2] Current unlearning techniques are much closer to refusal training than to information removal.

Hopes for successful information removal

Is it impossible to remove information from model weights? I don’t think so, there are several approaches which look promising to me.

Teaching false facts: If you modify documents to inject false information about the topics you care about, fine-tune an LLM on these documents (or throw them into the pretraining mix[3]), then if you did your job correctly, the LLM won’t be able to tell which facts are the correct ones and which ones are fake. If this is the case, LLMs will have a hard time knowing what facts they know about the target topic are fake and which ones are real, which is almost as bad as not knowing any fact about the topic in the first place. But getting this right requires work, and the simplest version of this approach explored in the paper did not work. Making this work in real life also requires a ton of care, even if it was technically successful at removing information.[4]

Reversing memorization / interpretability: SGD knows how to teach LLMs new things. Maybe it is possible to figure out a way of reversing this process given some documents that were responsible for the learning of undesirable information? I don’t have a particular proposal in mind, but I think this is something that good mechanistic interpretability should eventually be able to do.

Pre-training data filtering: if removal is too hard, you can try to not get the information in the weights in the first place by removing all data that provides the information you don’t want your model to know about in your next big pretraining run. But it might be very hard to actually have a good enough filter that you remove enough information that a big LLM is not able to guess forbidden facts, without reducing performance on adjacent tasks you care about. Also, you need to succeed on the first try because experiments on smaller models won’t be perfectly informative.

Side note: skills vs facts. I framed some of the approaches as teaching false facts / removing some facts - and concrete raw facts is often what is the easiest to evaluate. But I think that the same approach can be used to remove the information that makes LLMs able to use most skills. For example, writing Python code requires the knowledge of the Python syntax and writing recursive functions requires the knowledge of certain patterns (at least for current LLMs, who don’t have the ability to rederive recursive from scratch, especially without a scratchpad), both of which could be unlearned like things that are more clearly “facts”.

Using information removal to reduce x-risk

Information you should probably remove from the weights

If you had good information removal techniques, you could train a model and then get variants that know or don’t know certain sets of information.[5]

Some kinds of information are very hard to rediscover from scratch, such that removing them from early transformative AIs would probably reduce the risk of takeover, even if such AIs are very good at opaque reasoning:

Other kinds of information are probably feasible to rediscover from scratch given enough thinking time, but given the current AI paradigm, I think that removing such information is very likely to be helpful if you are doing at least basic monitoring of the Chains-of-Thought:

How removing information helps you

If you removed such knowledge from AIs, it would make it harder for egregiously misaligned AIs to take over even if they tried:

In other words, good information removal technology would make it easier to control powerful AIs [LW · GW], and it does so in a way that is relatively easy to evaluate - which is not the case of other approaches to limit the kind of knowledge that AIs can access, like coup probes [LW · GW].

Information you probably can’t remove - and why this won’t work for superintelligent AIs

I think that information about the topics described above can reasonably be removed, and that such information will be hard for AIs trained with the current paradigm to rediscover from scratch.

But there are two kinds of abilities which I think can’t be reasonably “unlearned” in a way that is as robust as the kind of information removal of our evaluation methodology is about:

The latter category will expand as AIs become more powerful. While I think information removal would be likely during the critical period where AIs will become very dangerous and very useful [LW · GW], especially if such level of capability is achieved without a radical change in paradigm, this approach won’t reduce the risk of takeover from radically superintelligent AIs.

  1. ^

    There might be some existing techniques which can remove a high fraction of the target information from model weights reliably, but we are not aware of any plausible candidate. There is a paper which claims to present an unlearning technique that is somewhat robust to fine-tuning, but we didn’t manage to reproduce their results, and the technique looks very similar to another fine-tuning robustness technique - which can be broken by doing fine-tuning using a better learning rate (see Appendix C of our paper, and Appendix E of theirs).

  2. ^

    This is true even for very large unlearning coefficients that start destroying the model. See Figure 3 in the paper.

  3. ^

    This is not information removal because the information was not there in the first place, this is closer to the pretraining filtering approach.

  4. ^

    Teaching LLMs false facts has some ethical concerns; because it could lead users to get the wrong information that you fabricated. It could also lead to weird dynamics where people could catalog all the false information you injected, or sample many times and filter the information based on some external source. It could also have some implications about an AI lab’s relationship with AIs, and could make things like credibly committing to paying AIs [LW · GW] harder.

  5. ^

    In the case of pre-training data filtering, train a minimal base model that doesn’t know all the specialized knowledge, and then increase its abilities in certain key domains.

1 comments

Comments sorted by top scores.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-10-15T03:22:11.832Z · LW(p) · GW(p)

This is important news, as I feel like there has been unwarranted confidence in the efficacy of existing unlearning techniques to genuinely remove the dangerous knowledge. 

As someone who worked on WMDP, I'm glad to see it being used for its intended purpose!