Locating and Editing Knowledge in LMs
post by Dhananjay Ashok (dhananjay-ashok) · 2025-01-24T22:53:40.559Z · LW · GW · 0 commentsContents
Do LMs Store Facts in Their Weights? But is this actually storing memories or knowledge? Editing Methods So what should we do then? None No comments
In my previous post I went over some common approaches for updating LMs with fresh knowledge. Here, I detail a specific approach that has gained popularity in recent years - locating and editing factual associations in language models. I do not believe in this approach, in this post I try to summarize it fairly, and explain why I don’t quite like it.
Do LMs Store Facts in Their Weights?
Language models that employ the transformer architecture have Feed Forward Networks (FFNs) as an important subcomponent. For any specific layer, the FFN has 2 sublayers within it.
During the forward pass of a LM, the FFN takes as input a dense vector representation from the previous layer and outputs a dense vector of its own.
Two key operations happen here, at each sublayer the vector is multiplied into a matrix of weights. There is a line of thought that sees the weights of the FFNs as a sort of neural database that stores memories.
Let us call the dense vector that goes into the FFN the query vector, the first layer weights the key matrix and the second layer weights the value matrix. Now look at how the FFN works, it takes in the query, applies a key transformation to it and gets a key representation and then finally uses the key representation to recall parts of the value matrix to provide a value representation (the final output). This gives us the interpretation:
The FFN component is said to store memories that are accessible via specific inputs to the component. The keys store input patterns that commonly occur in the training data, and the values store outputs that are triggered by those input patterns.
The paper that discovered this collected the sentences most associated with particular keys and had humans categorize them. The sentences that activated keys in the early layers showed shallow linguistic patterns (e.g. the word substitute is the final token), while those that triggered later layers have semantic patterns (e.g. whether the text refers to a TV show). In the same work they found that the output values of the FFN promote a particular output token that may go on to be the final prediction of the model.
There has since been a lot of work on studying how specific components or mechanisms in LMs react when the model is going to output factual information. These focus mainly on documenting which attention heads or parts of the model light up when processing subject, relation and object tokens and combine to form a prediction.
But is this actually storing memories or knowledge?
I am sceptical at the idea that specific facts or factual associations are stored in local areas of the LM. I am partial to the view that factual associations are “stored” in a distributed manner, suggesting that trying to identify “where” the facts are stored in a LM is not a fruitful endeavour.
To see what I mean by this, let us first look at the common paradigm that attempts to put this view into action.
Editing Methods
There aren’t too many methods that adopt this point of view, making it easy to draw a common philosophy between them.
What KN, MEMIT and PMET have in common is that they first try to locate the neurons (or layers) that “store” the factual knowledge they want to edit, and then make adjustments to that identified area (either by changing the inference time activation of those neurons or modifying the weights) to “rewrite the model’s knowledge”. I covered MEMIT in greater detail in the previous post. The only fundamental issue I will recount here is that all of these methods rely on triplets of the form (subject, object, relation), and it is unclear how (or whether) these methods can be used for general text based information.
But these methods, do work. So what’s happening here?
My theory is:
Instead of truly “editing factual knowledge”, these methods are making the model more likely to output a specific token when it receives a specific input
So this means when you “edit” the model to say “The capital of France is Dubai”, you are not really changing the way it internally models the relationship between France and Dubai (or Paris for that matter), rather you are making a forced adjustment that will make the model more likely to output “Dubai” when it encounters the prefix “The capital of France is”.
If my hypothesis is correct, this method will always fail to perform satisfactorily when you ask it to write a creative letter to its friend from the capital of France. This is because the exact form of the prefix is not the same, and hence the edit of the factual association is unlikely to transfer.
These results have started to come in: model editing methods are showing themselves unable to handle second order implications of their edited knowledge, are inconsistent, are often less robust to paraphrases than simply prompting the model and underperform in more realistic scenarios.
There are deeper philosophical reasons to be highly sceptical of these approaches. Namely, by calling into question this idea that language models should be viewed as a reliable repository of facts or beliefs in the first place.
Apart from this, model editing methods seem to damage the LMs other capabilities when applied for just a few edits sequentially. This makes sense to me, editing weights directly with no regard for linguistic fidelity seems certain to eventually lead to some form of collapse.
So what should we do then?
The views I advance in this post seem to suggest that I don’t believe that knowledge acquisition is possible at all. To some extent I think this is true, I believe that the only way to remove an association from the model is to retrain the whole model (or at least very large parts of it). However, this is not feasible.
The next best approach then, is to rely on a store of memory that is external to the model. Retrieve from that evidence store when you need to generate something instead of hoping that your model has knowledge that is accurate. Factual information is still vital in pretraining and fine-tuning, but this is because it gives the model the ability to properly process retrieved contexts as required.
In the next post I will dive into the methods or class of approaches that I do actually believe in, and try to come up with gaps in the field.
0 comments
Comments sorted by top scores.