Why no major LLMs with memory?
post by Kaj_Sotala · 2023-03-28T16:34:37.272Z · LW · GW · No commentsThis is a question post.
Contents
Answers 28 Carl Feynman 10 Lone Pine 7 Ustice 4 abhayesian 3 bvbvbvbvbvbvbvbvbvbvbv 3 Ustice 2 Bartlomiej Lewandowski None No comments
One thing that I'm slightly puzzled by is that an obvious improvement to LLMs would be adding some kind of long-term memory that would allow them to retain more information than fits their context window. Naively, I would imagine that even just throwing some recurrent neural net layers in there would be better than nothing?
But while I've seen LLM papers that talk about how they're multimodal or smarter than before, I don't recall seeing any widely-publicized model that would have extended the memory beyond the immediate context window, and that confuses me.
Answers
Models with long-term memory are very hard to train. Instead of being able to compute a weight update after seeing a single input, you have to run in a long loop of ”put thing in memory, take thing out, compute with it, etc” before you can compute a weight update. It’s not a priori impossible, but nobody’s managed to get it to work. Evolution has figured out how to do it because it’s willing to waste an entire lifetime to get a single noisy update.
People have been working on this for years. It’s remarkable (in retrospect, to me) that we’ve gotten as far as we have without long term memory.
↑ comment by jacopo · 2023-03-28T19:42:21.760Z · LW(p) · GW(p)
Isn't that the point of the original transformer paper? I have not actually read it, just going by summaries read here and there.
If I don't misremember RNN should be expecially difficult to train in parallel
Replies from: carl-feynman↑ comment by Carl Feynman (carl-feynman) · 2023-03-28T23:41:45.991Z · LW(p) · GW(p)
Transformers take O(n^2) computation for a context window of size n, because they effectively feed everything inside the context window to every layer. It provides the benefits of a small memory, but it doesn’t scale. It has no way of remembering things from before the context window, so it’s like a human with a busted hippocampus (Korsakoff’s syndrome) who can‘t make new memories.
↑ comment by Noosphere89 (sharmake-farah) · 2023-03-28T17:29:56.168Z · LW(p) · GW(p)
I suspect much of the reason we didn't need much long term memory is that we can increase the context window pretty cheaply, thus long-term memory is deprioritized.
There is an architecture called RWKV which claims to have an 'infinite' context window (since it is similar to an RNN). It claims to be competitive with GPT-3. I have no idea whether this is worth taking seriously or not.
↑ comment by abhayesian · 2023-03-28T22:15:21.410Z · LW(p) · GW(p)
I don't think it's fair for them to claim that the model has an infinite context length. It appears that they can train the model as a transformer, but can turn the model into an RNN at inference time. While the RNN doesn't have a context length limit as the transformer does, I doubt it will perform well on contexts longer than it has seen during training. There may also be limits to how much information can be stored in the hidden state, such that the model has a shorter effective context length than current SOTA LLMs.
↑ comment by bvbvbvbvbvbvbvbvbvbvbv · 2023-03-29T08:00:02.179Z · LW(p) · GW(p)
Two links related to RWKV to know more :
Given that LLM’s can use tools, it sounds like a traditional database might be able to be used. The data would still have to fit inside the context window, along with the generated continuation prompt, but that might work for a lot of cases.
↑ comment by hold_my_fish · 2023-03-29T03:54:24.779Z · LW(p) · GW(p)
I could also imagine this working without explicit tool use. There are already systems for querying corpuses (using embeddings to query vector databases, from what I've seen). Perhaps the corpus could be past chat transcripts, chunked.
I suspect the trickier part would be making this useful enough to justify the additional computation.
One thing that comes to mind is DeepMind's Adaptive Agents team using Transformer-XL, which can attend to data outside the current context window. I think there was speculation that GPT-4 may also be a Transformer-XL, but I'm not sure how to verify that.
↑ comment by Oliver Daniels-Koch (oliver-daniels-koch) · 2023-03-29T15:53:53.645Z · LW(p) · GW(p)
Briefly read a Chat-GPT description of Transformer-XL - is this essentially long term memory? Are there computations an LSTM could do that a Transformer-XL couldn't?
Replies from: abhayesian↑ comment by abhayesian · 2023-03-29T19:25:18.504Z · LW(p) · GW(p)
There is still technically a limit to how far back a Transformer-XL can see since each layer can only attend to previous keys/values computed by that layer. As a result, the receptive field of layer L can only be as wide as the last L context windows. I guess this means that there might be some things that LSTMs can do that Transformer-XL can't, but this can be fixed with a couple of minor modifications to Transformer-XL. For example, this paper fixes the problem by allowing layers to attend to the outputs of later layers from previous context windows, which should make the receptive field (at least theoretically) infinitely long, meaning it should probably be able to do everything an LSTM can.
On mobile but FYI langchain implements some kind of memory.
Also, this other post might interest you. It's about asking GPT to decide when to call a memory module to store data : https://www.lesswrong.com/posts/bfsDSY3aakhDzS9DZ/instantiating-an-agent-with-gpt-4-and-text-davinci-003 [LW · GW]
Given that we know that LLM’s can use tools, can traditional databases be used for long-term memory?
I think there has been a lot of research in the past in this space. The first thing that popped into my mind was https://huggingface.co/docs/transformers/model_doc/rag
Currently, there are some approaches using langchain to persist the history of a conversation into an embeddings database, and retrieve the relevant parts performing a similar query / task.
No comments
Comments sorted by top scores.