LEAst-squares Concept Erasure (LEACE)

post by tricky_labyrinth · 2023-06-07T21:51:04.494Z · LW · GW · 10 comments

This is a link post for https://twitter.com/norabelrose/status/1666469917636571137

Contents

10 comments

"Ever wanted to mindwipe an LLM?

Our method, LEAst-squares Concept Erasure (LEACE), provably erases all linearly-encoded information about a concept from neural net activations. It does so surgically, inflicting minimal damage to other concepts.

...

LEACE has a closed-form solution that fits on a T-shirt. This makes it orders of magnitude faster than popular concept erasure methods like INLP and R-LACE, which require gradient-based optimization. And the solution can be efficiently updated to accommodate new data."

10 comments

Comments sorted by top scores.

comment by leogao · 2023-06-08T17:41:23.426Z · LW(p) · GW(p)

My summary of the paper: The paper proves that if you have two distributions that you want to ensure you cannot distinguish linearly (i.e a logistic regression will fail to achieve better than chance score), then one way to do this is to make sure they have the same mean. Previous work has done similar stuff (https://arxiv.org/abs/2212.04273), but without proving optimality.

Replies from: nora-belrose
comment by Nora Belrose (nora-belrose) · 2023-10-20T17:02:37.470Z · LW(p) · GW(p)

then one way to do this is to make sure they have the same mean

Yep, although we actually go a bit further than that and show that making the means equal is necessary, at least if you want your method to work for general convex loss functions.

comment by ProgramCrafter (programcrafter) · 2023-06-08T07:36:07.212Z · LW(p) · GW(p)

And is there a technology to erase certain concept from the prompt?

It can be useful both for AI safety and capabilities: currently LLM cannot forget unsuccessful attempts to solve the task, and those can make it harder to find new ways to solve. (I'd call that "object permanence of simulacrum").

Replies from: sanxiyn
comment by sanxiyn · 2023-06-08T07:44:43.772Z · LW(p) · GW(p)

What do you mean? Current LLMs are stateless. If unsuccessful attempts to solve the task are made, just reset the history and retry.

Replies from: programcrafter
comment by ProgramCrafter (programcrafter) · 2023-06-08T07:48:31.297Z · LW(p) · GW(p)

I mean something like AutoGPT, where there is no human in the loop who could reset the history.

For example, I've seen how ChaosGPT got into a loop of "researching nuclear weapons". Probably if it could erase them completely from its context, it would generate more interesting ideas (though, there is still a question whether we need that).

Replies from: sanxiyn
comment by sanxiyn · 2023-06-08T07:56:53.996Z · LW(p) · GW(p)

That is trivial to program? For example, you can have AutoGPT UI which lists pending tasks with icons next to them, where clicking a trashcan will completely erase it from the context. That doesn't need any LLM-level help like LEACE.

Replies from: dr_s
comment by dr_s · 2023-06-08T12:40:42.973Z · LW(p) · GW(p)

And of course you could also have another LLM instance with specific instructions acting as some kind of censor which judges which prompts should be erased automatically.

comment by Noosphere89 (sharmake-farah) · 2023-06-07T22:38:40.806Z · LW(p) · GW(p)

This is very good news for AI ethics and AI safety, and this really should be celebrated as important progress on both AI safety in the context of existential risk and AI ethics in the context of fairness and bias.

comment by dr_s · 2023-06-08T12:57:54.579Z · LW(p) · GW(p)

Ever wanted to mindwipe an LLM?

Cue thriller novel about an Open AI researcher who committed a murder and for some reason the crucial evidence that could get him arrested ended up in the training set of the newest GPT, so even if he scrubbed it from the dataset itself he now lives in fear that at some point, in some conversation, the LLM will just tell someone the truth and he'll be arrested.

(jokes aside, great work! This actually looks like fantastic news for both AI ethics and safety in general, especially once it is generalised to other kinds of AI beside LLMs, which I imagine should be possible)

Replies from: nora-belrose
comment by Nora Belrose (nora-belrose) · 2023-10-20T17:04:46.935Z · LW(p) · GW(p)

especially once it is generalised to other kinds of AI beside LLMs, which I imagine should be possible

The method actually already is highly general, and in fact isn't specific to deep learning at all. More work does need to be done to see how well it can steer neural net behavior in real world scenarios though