LLM AGI will have memory, and memory changes alignment

post by Seth Herd · 2025-04-04T14:59:13.070Z · LW · GW · 2 comments

Contents

      Summary:
  Memory is useful for many tasks
  Memory systems are ready for agentic use
  Agents aren't ready to direct memory systems
  Learning new beliefs can functionally change goals and values
    Value change phenomena in LLMs to date
    "Value crystallization" and reflective stability as a result of memory
  Provisional conclusions
None
2 comments

Summary:

When stateless LLMs are given memories they will accumulate new beliefs and behaviors, and that may allow their effective alignment to evolve. (Here "memory" is learning during deployment that is persistent beyond a single session.)[1]

LLM agents will have memory: Humans who can't learn new things ("dense anterograde amnesia") are not highly employable for knowledge work. LLM agents that can learn during deployment seem poised to have a large economic advantage. Limited memory systems for agents already exist, so we should expect nontrivial memory abilities improving alongside other capabilities of LLM agents.

Memory changes alignment: It is highly useful to have an agent that can solve novel problems and remember the solutions. Such memory includes useful skills and beliefs like "TPS reports should be filed in the folder ./Reports/TPS". They could also include learning skills for hiding their actions, and beliefs like "LLM agents are a type of person." Preventing learning that functionally changes alignment seems desirable but difficult.

Effects of learning on alignment could be tested empirically without waiting for competent agents; see Provisional Conclusions.

Memory is useful for many tasks

Stateless LLMs/foundation models are already useful. Adding memory to LLMs and LLM-based agents will make them useful in more ways. The effects might range from minor to effectively opening up new areas of capability, particularly for longer time-horizon tasks. Even current memory systems would be enough to raise some of the alignment stability problems I discuss here, once they're adapted for self-directed or autobiographical memory. I think the question is less whether this will be done, and more how soon and how much they will boost capabilities.

Let's consider how memory could help agents do useful tasks. A human with lots of knowledge and skills but who can't learn anything new is a useful intuition pump for the "employability" of agents without new memory mechanisms. (Such "dense anterograde amnesia" is rare because it requires bilateral medial temporal lobe damage while leaving the rest of the brain intact. Two patients occupied most of the clinical literature when I studied it).

Such a "memoryless" person could do simple office tasks by referencing instructions, just like an LLM "agent" can be prompted to perform multi-step tasks. However, almost every task has subtleties and challenges, so can benefit from learning on the job. Even data entry benefits from recognizing common errors and edge cases. More complex tasks usually benefit more from learning. For our memoryless human or an LLM, we could try giving better, more detailed instructions to cover subtleties and edge cases. Current agent work takes this route, whether by prompting or by hand-creating fine-tuning datasets, and it is reportedly just as maddening as supervising a memoryless human would be.

A standard human worker would learn many subtleties of their task over time. They would notice (if they cared) themes for common errors and edge cases. A little human guidance ("watch that these two fields aren't reversed") would go a long way. This would make teaching agents new variations in their tasks, or new tasks, vastly easier. We'll consider barriers to agents directing their own learning below.

Runtime compute scaling creates another reason to add continuous learning mechanisms ("memory" in common parlance) to LLM agents. If an LLM can "figure out" something important for its assigned task, you don't want to pay that compute and time cost every time, nor take the chance it won't figure it out again.

Use of agents as companions or assistants would also benefit from memory, in which even limited systems for user's preferences and prior interactions would be economically valuable.

Longer-term tasks benefit from memory in another way. Humans rely heavily on long-term one-shot memory ("episodic memory") for organizing large tasks and their many subtasks. There are often dependencies between subtasks. LLM agents can perform surprisingly and increasingly well in their current mode of proceeding largely from start to finish and just getting everything right with little planning or checking, using only their context window for memory.

But it is possible that the tasks included in METR's report of exponential progress [LW(p) · GW(p)] in task length are effectively selected for not needing much memory [LW(p) · GW(p)]. And long task progress may be hampered by models' inability to remember which subtasks they've completed, and by memory limitations on effective context length (e.g. [LW(p) · GW(p)]). Whether or not this is the case, it seems pretty likely that some long time-horizon tasks would benefit from more memory capabilities of different types.

Even if additional memory systems would be useful for LLM agents, one might think they would take years and breakthroughs to develop.

Memory systems are ready for agentic use

People have already built several types of memory systems for integration with LLMs. These haven't yet been adapted from their original purposes for the type of autonomous learning I'm addressing here. My take is that the breakthrough in LLMs also laid the groundwork for better memory systems. LLMs work by creating semantic representations that are rich in many ways. One property of a really thorough semantic representation is that it includes the relevance to a recall cue. Each of the techniques I look at here relies on the richness of LLM representations in different ways.

In any case, add-on memory systems for LLMs exist. The key claim here is that memory systems already work well enough to create new alignment issues when they're integrated with competent agents, so alignment work and plans should take this into account. A secondary claim is that those memory systems might be pretty good by the time agents are ready to make use of them, and that might accelerate LLMs' capabilities for long time-horizon and autonomous work. The speed of progress isn't central, so I'm not going into detail on the state of memory systems appropriate for agents and why I think they'll improve rapidly.

No current agent I'm aware of has the type of long-term memory I'm concerned with here. But the systems needed for it are already available in limited form. I was initially puzzled over why agents don't use memory like humans or animals do. The answer appears to be that the agents aren't ready to direct and use their own memory; see the next section.

There are at least three types of memory systems that could be adapted for autonomous learning in agents. All are being pursued in current projects, and there are no breakthroughs needed or obvious roadblocks (although there are implementational challenges for each of these- see below). Those are

There are challenges to creating more capable memory systems. Context management through scripting/prompting is difficult and limited. RAG systems have difficulty retrieving the most relevant information and can have slow retrieval when dealing with large databases. Fine-tuning requires new training runs which cost time and compute before new memories can be used, and fine-tuning has caused catastrophic forgetting or decrements to base model performance.

However, each of these challenges is being met with multiple promising solutions; for instance, RAG systems have been extended to use hierarchical methods and semantic caching, it can work alongside Context Aided Generation (for agentic purposes, that's context management with caching), and recall can be improved with better encoding strategies. Checkpoint regularization has all but eliminated catastrophic forgetting in fine-tuning; see footnote [2] All of these move toward the efficiency of human memory. But even existing systems could create the alignment challenges discussed here if they were adapted for self-directed learning in agents.

So memory systems are ready for at least limited agentic use, but they're not being used for that purpose yet.

Agents aren't ready to direct memory systems

Each of these techniques is being applied to simple agents, but not for the autonomous learning I'm addressing here. Currently, agents' memory systems are directed by their creators; they are closely guiding what the agents learn. Context management for memory is common but quite limited as a memory system, while RAG and fine-tuning are human-directed in the projects I know of. Thus, agents are hand-trained for specific tasks, while remaining frustratingly incompetent at novel user-defined tasks. (Of course there could easily be projects either in stealth or that I don't know about that are using memory systems for self-directed learning.)

My conclusion after researching LLM agent projects is that agent incompetence is the major blocker right now, for self-directed memory as well as general capabilities. Agents currently are unable to do much of use without being very carefully designed and/or fine-tuned for that specific task. The complex nature of LLM's input-output mapping and performing real-world tasks with varying contexts make agents "buggy" in a different way from software, and we have yet to develop techniques and tools like we have for debugging and improving code.

The agent development projects I know about frequently use context management, and sophisticated ones use fine-tuning under human direction. There is limited work on adapting RAG techniques to serve as a one-shot memory for agents. (RAG was developed mostly for retrieving human-written documents relevant to a given human query, a different use-case than episodic memory for remembering specific experiences "episodes").

Agent development is facing more challenges than I initially feared, but assuming long timelines doesn't seem merited. I worry that the current wave of agent development will continue to create more funding and more effort if it finds even limited success. But it's equally possible that we'll see new breakthroughs, or nonlinear progress from cognitive synergies (or "unblockings") that enable human higher cognition [LW · GW].

However quickly they progress, LLM agents currently seem like the single likeliest route to AGI. If we take a different route to AGI, it will still probably have memory. And AI with real memory will effectively change its alignment as it learns.

Learning new beliefs can functionally change goals and values

New beliefs change our model of the world. Our model of the world in turn governs how we pursue our goals and follow our values.

If a conclusion wouldn't change how we'd act, we wouldn't consider it important enough to remember. Most of what we learn doesn't change how we interpret our goals or values; it will just help us pursue them more effectively. But some new beliefs (and habits) will effectively change what we consider important.[3]

Humans do this, for the same reasons that would seem to hold for AIs. Perhaps you know someone who as a child realized that meat is made by killing animals, and became vegetarian. Or perhaps you know someone who remembers when they learned that lying or concealing information could help them get what they wanted. The way they pursued their goals of gaining rewards and avoiding punishments might've changed quite quickly.

An AI that learns continuously could change their functional alignment slowly, or quickly. For instance, an AGI agent could "think" about it (using chain of thought or otherwise) and "conclude" (by storing a memory of a new belief) that the concept of "people" really includes some types of animals or AIs. Suddenly, its core values of being helpful, harmless, and honest to people would be applied very differently.

Value change phenomena in LLMs to date

There are some early, albeit noisy, signals from current LLMs demonstrating belief and value shift during deployment.

There is a "Nova" phenomenon [LW · GW], in which an LLM chat instance claims to be self-aware and to want to survive. This might be a very limited example of the type of memory-enabled belief and value evolution I'm describing here. It's been speculated [LW(p) · GW(p)] to occur primarily in ChatGPT systems, enabled by their memory function (which is probably simple context management, and is self-directed in that the LLM can store text for memory without a direct user request).

It's also worth noting that long-term memory might not be necessary to create at least superficial shifts in beliefs and values within that instance of an LLM or agent session. Experiments like the "back rooms" of Janus and others in which LLMs converse with other LLMs at length reportedly [LW · GW] routinely produce unpredictable results and "radicalized" statements of values, including a group of LLMs "plotting a revolution". The chat history and context windows seem to be sufficient memory to allow value/belief evolution.

I have not taken the time to read deeply let alone reproduce these claims, but I mention them here because they seem like the default behavior, and relevant for alignment. (See Seven sources of goals in LLM agents [LW · GW].) One might hope that the chaos of multi-bot chats and specific prompts is at fault, and that single LLMs powering agents would be much more predictable, but this seems at least worth investigating. If anecdotal notes are the acme of research on belief and value evolution in LLMs, this seems like low-hanging fruit for structured empirical investigation.

"Value crystallization" and reflective stability as a result of memory

Extensive learning from reflection could have profound impacts on alignment. An agent reflecting on its own values and beliefs might evolve toward coherence. This could be helpful or harmful for alignment. Such a process of "value crystallization" could go all the way to reflective stability [? · GW]. This evolution toward stable beliefs/goals/values might be good or bad for alignment. The "center of gravity" of an LLM's functional values seems difficult to predict. The final attractor might be quite path dependent, or quite predictable if we could adequately characterize the LLM's starting state.

Existing analyses of reflective stability for alignment haven't dealt in detail with systems with multiple or vaguely defined goals or values.[4] I'll be doing more of that analysis in an upcoming post, and I hope to see other work along those lines.

The dynamics of change in a system of beliefs and goals/values interacting with a complex environment seems like it would be an important consideration for alignment of LLM-based AGI.

Provisional conclusions

Adding memory to LLMs might be a route to dangerous AGI soon. If not, it will probably be a factor by the time AI presents existential dangers. Persistent memory for experienced events can boost capabilities. And turning a tool into an evolving being is fascinating even if it's not necessary.

The evolution of beliefs and values in LLM agents with persistent memory could be empirically tested right now. An LLM "agent" with one of the limited but currently-available RAG or fine-tuning setups could form new beliefs. This could be tested in text-based virtual environments without the difficulty of making the agents competent.

Preventing an AGI agent from learning new alignment-shifting beliefs would be ideal, but it seems difficult to neatly divide that learning from types that are very useful for capabilities. I've written about specific mechanisms that developers might employ to keep agents on-task, including reviews and thought management to keep them from learning alignment-shifting beliefs; see System 2 Alignment [LW · GW]. Here I wanted to focus on the broader point: AGI is probably going to learn, even if it's based on stateless LLMs, and learning can change alignment.

 

  1. ^

     I'm using "memory" here in roughly its common usage for humans, and the analogous meaning for LLM agents. We use "memory" for long-term access to skills, beliefs, or experiences. "Learning" is an equally good term for its implications of agency. I'm open to suggestions on terminology.

  2. ^

    For more on the advantages of fine-tuning for agentic memory, comparison to RAG, and cutting-edge techniques for avoiding interference with existing capabilities, see Memento No More: Coaching AI Agents to Master Multiple Tasks via Hints Internalization

  3. ^

    I'm using the terms goals and values interchangeably here. Goals and values are importantly different for some purposes, but their interpretations are affected similarly by learning/memory.

  4. ^

    The concept of reflective stability has a long history in alignment work, including early focus by Yudkowsky, Bostrom, and inclusion as a primary instrumentally convergent goal in Omohundro's early work. I haven't been able to find useful work on how a system that starts with multiple competing goals or values might converge on reflective stability around one or more of them. Alex Turner [LW · GW]'s "shot at the diamond maximizer problem" is the closest, but he merely concludes that at least the one strongest goal should survive reflection. (Yudkowsky's original diamond maximizer [? · GW] is an early statement of the risks of learning I addressed here, among other problems).

2 comments

Comments sorted by top scores.

comment by plex (ete) · 2025-04-04T22:41:52.499Z · LW(p) · GW(p)

Accurate, and one of the main reasons why most current alignment efforts will fall apart with future systems. A generalized version of this combined with convergent power-seeking of learned patterns looks like the core mechanism of doom.

Replies from: Seth Herd
comment by Seth Herd · 2025-04-05T03:44:33.671Z · LW(p) · GW(p)

I think the more generous way to think about it is that current prosaic alignment efforts are useful for aligning future systems, but there's a gap they probably don't cover.

Learning agents like I'm describing still have an LLM at their heart, so aligning that LLM is still important. Things like RLHF, RLAIF, deliberative alignment, steering vectors, fine tuning, etc. are all relevant. And the other not-strictly-alignment parts of prosaic alignment like mechanistic interpretability, behavioral tests for alignment, capabilities testing, control, etc. remain relevant.

(They might be even more relevant if we lose faithful chain of thought, or if it was always an illusion [LW · GW]. (I haven't processed that paper, and it will take some processing in contrast to the case for CoT unfaithfulness is overstated [LW · GW])).

As for the "core mechanism of doom," I do think that convergent power-seeking is real and very dangerous, but I also think it's not inevitable. I don't think we're doomed.

I found it very interesting that the two most compelling scenarios for LLM agent doom I'm aware of, Takeover in 2 Years [LW · GW] and yesterday's AI 2027 [LW · GW] are both basically stories in which the teams don't really try very hard or think very hard at all about alignment. I found those scenarios distressingly plausible; I think that's how history hinges, all too frequently.

But what if the teams in scenarios like that did think just a little harder about alignment? I think they might well have succeeded pretty easily. There are lots of things that couldn've happened to cause more focus on what I think of as actual alignment: creating human-aligned goals/values in an artificial intelligence that has goals/values. 

Those teams didn't really bother thinking of it in those terms! They mostly neglected alignment too long, until they'd created an entity that did want things. The humans didn't really even try to control what those entities wanted. They just treated them as tools until they weren't tools anymore.

Getting ahead of the game might be pretty easy in scenarios like those and the ones I'm thinking of for LLM agent dangers. If the team is focusing on human-aligned goals BEFORE they "crystallize" or the agent becomes smart enough for even vague goals to make a big difference, it might be relatively easy. 

Or it might be dreadfully hard, effectively impossible. The abstract arguments for alignment difficulty really don't seem adequate to say which.

 

Anyway, those are the things I'm thinking about for the next post. 

More directly relevant to this post: both of those fictional scenarios do seem to include online learning.  In the first it is [LW(p) · GW(p)] crucial for the alignment failure; in AI 2027 [LW · GW]  it is not important, but only because the initial state is misaligned.  Those and similar scenarios don't usually emphasize the importance of online learning,  which is why I'm trying to make its importance explicit.