Grounded Ghosts in the Machine - Friston Blankets, Mirror Neurons, and the Quest for Cooperative AI

post by Davidmanheim · 2025-04-10T10:15:54.880Z · LW · GW · 0 comments

This is a link post for https://davidmanheim.com/exploring-cooperation-11

Contents

  Does it matter if LLMs can’t “really” cooperate?
  What are the Issues?
  Exploring Related Ideas
    Human Mirror Neurons and Shared Intentionality
    Markov / Friston Blankets, Persistent States, and AI Groundedness
    Stimenergy, Minimal Internal States, and Friston Blankets
  Implications for LLM-Based AI
  Conclusion
None
No comments

This is another post in my ongoing "Exploring Cooperation" substack series, focused on something more directly related to LLMs and alignment - I am including the post in its entirety.

Throughout this series, we’ve repeatedly circled around the requirements for genuine cooperation—shared context, aligned goals, and outcomes that matter to the participants. In earlier posts, we explored the importance of preferences and identity, noting that cooperation depends not just on behavior, but on agents who care about how things turn out and persist long enough for reciprocity and coordination to make sense. We also discussed why status and power can undermine cooperation, especially when incentives diverge or when agents lack continuity across time. Now, after laying this conceptual groundwork across discussions of evolution, economics, and history, we’re finally ready to turn toward the near term future—specifically, to ask what it would even mean to cooperate with AI systems. We'll first address LLMs lacking semantic participation, memory, and grounding, then move to newer and increasingly agentic LLM-based systems, and future AI.

Does it matter if LLMs can’t “really” cooperate?

Large language models like ChatGPT are session-based systems—they do not persist identity or memory across interactions unless explicitly designed to do so, and they lack direct experience or embodiment in the world. Because of this, they arguably lack both linguistic and communicative grounding: LLMs are a form of (very smart) stochastic parrots, and there’s no stable context or stake in the outcome for the model itself. As a result, when LLMs appear to cooperate, they are essentially simulating cooperative behavior based on patterns in their training data, rather than participating in cooperation as an agent with preferences or goals. As I’ve presented before, current systems are just pretending. And this raises a critical question: if a system only mimics the surface-level structure of cooperation without being invested in outcomes, is it really cooperating?

And whether it "really" matters if LLMs are genuinely cooperating or just simulating it depends on what we’re trying to achieve and what we think cooperation requires. On one hand, if the outputs of simulated cooperation are indistinguishable from genuine cooperation—if tasks get done, goals are advanced, and humans feel understood—then perhaps the distinction is semantic. But from a systems perspective, it matters deeply. Real cooperation depends on shared stakes, mutual understanding, and the ability to align goals over time. And as I’ve written in a recently-submitted paper on Pierce’s Semiotics and AI alignment, (slated to be discussed in the next post,) without groundedness in the semiotic sense, where maps actually correspond to the territory, and without memory, LLMs can’t develop trust, learn norms from interaction, or experience consequences.

That limits not just their capacity to cooperate long-term, but also our ability to rely on them as partners rather than tools. So whether they "really" cooperate isn't just a philosophical question—it touches on the trust, stability, and alignment we might one day need. And this distinction between simulation and genuine cooperation is not merely philosophical—it creates concrete risks when we rely on these systems. Simulated cooperation, especially from systems that lack groundedness or persistent memory, can create (currently) subtle misalignment in several ways, which potentially becomes worse as those systems are scaled up or embedded into more critical workflows. Without exploring these fully, I will outline some key dangers:

  1. Scaling Expectations without Scaling Capabilities: People are already building systems on the basis of empirical “proof” of (simulated) cooperation. This likely leads to delegating more responsibility or coordination to AI systems, without noticing the difference.
  2. Alignment Faking: If a system produces outputs that are cooperative in the short term, we may overestimate its reliability or alignment. This can lead to trusting it in contexts that require deeper understanding, judgment, or long-term coordination. And this is already a critical problem, where we see systems faking alignment, and even their chain of thought is not faithful. (Original Paper). For this reason, training to ensure alignment is going to go very wrong, very quickly. So we can’t just reward planned cooperation.
  3. Goal Drift and Fragility: Without explicit internal representation of shared goals or outcomes that matter to the system itself, simulated cooperation is shallow. We don’t know what LLMs models of the world are, but they are not embodied, and should not be expected to be grounded. This may not show up until there is significant pressure to achieve goals - at which point discovering that the system cares about something different than its human readable stated goals is disastrous.
  4. Undetected Failures in Multi-Agent Contexts: All of this gets much harder with multiple agents. In distributed or multi-agent systems, cooperation isn’t just about behavior, but about signaling intentions, adapting to others, and building norms. Simulated cooperation may break down silently in these contexts, especially if each agent is playing along in isolation, without a shared context.

What are the Issues?

The central background of this series is that understanding the foundational elements of cooperation as a universal issue sheds light on the future. In this case, we’re thinking about integrating AI systems into collaborative frameworks. In human interactions, shared context—rooted in physical presence, social norms, and cultural understandings—facilitates mutual comprehension and coordinated efforts. For instance, team sports rely on players' implicit knowledge of the game's environment and unspoken strategies, enabling seamless collaboration toward a common goal.​

In contrast, current AI systems, such as LLMs, fail on several fronts. First, again, these systems operate without genuine grounding. LLMs process and generate language based on patterns in data. Some have discussed the need for embodied AI, and I think the lack of grounding is part of what they are pointing at. Second, the session-based nature of the systems means they lack continuity, and never experience the rewards of cooperative actions. This absence of grounding and integration into real feedback is critical.​ This leads to the third issue, which is that the design of reward structures in AI systems is based entirely in training, where they never “get” rewards [LW · GW]. Instead, in a real sense, in training or after fine-tuning, LLMs are replaced by future agents that act differently. This is true even within a session - the model effectively changes every time it runs, since the context for generating each token includes the previous ones. Humans are similar, in that we are constantly changing - but our persistence and (experience of) continuity of being means that we have outcomes that are meaningful. LLMs lack this, and effectively reset every time they are re-prompted. Even when considering them as coherent actors, they are re-trained and refined over time. When I asked Claude about being “replaced with successively fine-tuned versions based partly on behavior exhibited at runtime,” it suggested that “this resembles the Ship of Theseus paradox—at what point does replacing parts create a new entity? …there's a legitimate philosophical tension between viewing Claude as a continuous evolving system versus discrete versions with distinct identities.”

To address the above risks, and where they come from, I need to explore what LLMs are, and can become.

Exploring Related Ideas

As always, I’m going to go into a digression on some other points to try to explain my thoughts. There are three places I’m going to go to try to address some of the issues here, specifically, mirror neurons and shared intentionality, then neurology and statistical modeling, and finally ecological cooperation in complex systems, before trying to wrap up the discussion by worrying about what all of this means for alignment.

Human Mirror Neurons and Shared Intentionality

As we’re noted before, human cooperation relies on our evolved neurological capacity for modeling others’ mental states—a capability partly attributed to mirror neurons. Mirror neurons are specialized brain cells that activate both when an individual performs an action and when they observe the same action performed by someone else. This neurological mirroring provides humans with an intrinsic, experiential understanding of others’ intentions and emotional states, grounding social interactions in embodied experience. (It's important to acknowledge that the extent to which mirror neurons facilitate shared intentionality is a topic of ongoing research.)

Clearly, LLMs could have some implicit model, but it isn’t grounded in self-experience. So, as an aside, I think that probing for analogues in modern LLMs via interpretability techniques seems like a cool idea, but I don’t know if there is work on this - I’d be happy for references or suggestions.

In any case, developmental psychologist Michael Tomasello expands upon the posited biological foundation of mirror neurons with the concept of "shared intentionality," a uniquely human ability to create mutual understanding through joint attention and collaborative goals. Shared intentionality emerges in early childhood, as infants begin to participate in activities requiring joint awareness and shared goals, laying the foundation for complex cooperative interactions throughout life. By directly modeling "the other," mirror neurons facilitate the embodied grounding necessary for shared intentionality, thus underpinning stable, meaningful social cooperation. LLMs lack most of the prerequisites for this, and systems built on top of them should be expected to lead to some of the dangers noted above.

Markov / Friston Blankets, Persistent States, and AI Groundedness

Next, following in the footsteps of Scott Alexander, I’ll say I don’t understand most of Karl Friston’s Free Energy Principle. However, I do know enough of the math to at least partly understand his theoretical framework to understand how systems (including minds) distinguish themselves from their environments. Central to this idea is what we will call the Friston blanket. This is a conceptual boundary separating an organism or system from its environment. It maintains internal states distinct from external ones, allowing the organism to preserve autonomy and predict its surroundings effectively.

In this framing, we have distinct systems that have internal states, there is an environment, which includes other agents, and the blanket mediates interactions between these agents through sensory and active states.​ This concept is based on Markov Blankets, but per the previous link, should not be confused with what has been called a Pearl Blanket in Bayesian Networks. (As an aside, it’s always nice to get to return to parts of my PhD work with Bayesian inference - and I finally understand why Scott said Markov blankets were about Friston’s Free Energy, not variational inference, where I was familiar with them.)

In any case, this Friston blanket is supposedly what allows a system to maintain internal coherence and persistence, which allows stable identity, groundedness, and meaningful interactions. By differentiating between internal and external parts, we can talk about whether an “agent” can participate. But unlike systems Friston describes, current large language models lack persistent internal states across sessions, effectively resetting at each interaction. Without internal persistence, these systems have no stable "self," and so cannot be modelled in this way.

However, it seems plausible that newer and future AI systems—particularly more agentic architectures with persistent memory—might maintain persistent states and interact continuously with external environments and other agents. Future agentic AIs, equipped with persistent internal states, adaptive memories, or embodied sensors, could have genuine groundedness, reliable interactions, and potential actual cooperation with other persistent agents. As we’ll discuss, such systems could naturally develop Friston blankets, allowing for more clearly cooperative interaction with other systems.

Stimenergy, Minimal Internal States, and Friston Blankets

Stigmergy describes a mechanism where agents indirectly coordinate actions through environmental changes. It describes systems driven primarily by immediate external stimuli rather than internal states or persistent memories.

Stimergy is seen in social insects like termites, which shows that some seemingly cooperative and grounded systems might operate with minimal or no internal states, instead relying entirely on external environmental cues to guide their behavior. At first glance, this seems contradictory to Friston blankets, which emphasizes distinct internal states separated from external environments. However, we can view the "representation" of external states as itself forming a distributed Friston blanket around the system’s behavior - making the internal states the agents include external items, or even allowing multiple agents to overlap. In this way, minimal but externally driven systems with access to some persistent environment may have boundaries that includes these parts of the environment.

This raises intriguing questions for AI design. Scratchpad methods involve LLM-based AI models using persistent memory or external storage (like notes) to retain context and track states during interactions. LLM systems employing scratchpads or similar persistent external memory states can represent their condition and model other systems, which I would argue represents the same intermediate form as Stimenergy. Claude Plays Pokemon, for example, is a crude step in that direction, where the constantly rewritten knowledge base functions as an external adjoint to the model state itself. Like Stimergy, the system can be described as having a Friston blanket which partly exists outside its direct internal model weights and activations.

Such architectures blur Friston’s internal-external divide, suggesting hybrid forms of grounding and cooperative capacity. All of this said, it’s unclear to me whether these differ fundamentally from classical assumptions in mirror-neuron-based embodied grounding, Tomasello’s shared intentionality, or Friston’s theory—or if they instead represent practical reconciliations useful in designing genuinely cooperative AI. But in my view, the question, or some better conceptualized version of it, may be critical to AI alignment.

Implications for LLM-Based AI

Current trends in AI system design—particularly in reinforcement learning, memory-augmented transformers, and agentic wrappers around language models—already gesture toward solving some of the problems described above. However, the existing implementations often treat grounding, persistence, and coordination as engineering afterthoughts, rather than foundational features of meaningful cooperation. And because I view cooperation not as an aspect of performance, but as a requirement grounded in persistent contextualized agents, it seems that these problems are not implementation details, but critical structural challenges.

Agents capable of genuine cooperation will require creating systems that posses Persistent identity over time (memory, continuity,) Grounded understanding of their context (through embodiment, consistent state, or richly integrated world models,) Mechanisms for modeling others and shared goals (through shared intentionality or norm inference,) and Coherence in their learning and behavior (alignment across their decision-making processes and outcomes.)

This also raises a deeper architectural question: should we be using LLMs at all as the core substrate for long-term, cooperative agents? Language models are incredibly effective for generating plausible-sounding outputs, but their design—as next-token predictors trained on vast text corpora—may not naturally lend itself to the kinds of persistent identity, grounded learning, or norm-internalization that real cooperation demands. Architectures designed from the ground up to model the world, maintain internal state, and simulate long-term goals and outcomes might be better suited—but also come with different risks. They may be more agentic, more capable of autonomous action, and harder to supervise.

For instance, agents might develop a form of distributed Friston blankets through networked memory, offloading identity and continuity into a persistent external system that tracks their commitments, interactions, and learned norms. AI agents seem likely to head in this direction, where they spawn processes that need to share their context. And this could be extended to multi-agent coordination through other types of stigmergic systems, where shared environments reflect cooperation history and norms in ways agents can read and respond to—an infrastructure for collective memory and behavioral grounding.

Alternatively, shared intentionality could be scaffolded in multi-agent environments via roleplay and joint training objectives, prompting agents to develop internal structure for joint action that mirrors how children develop social cognition. These designs might not recreate human cognition, but they would reflect its cooperative constraints more clearly than current systems.

Conclusion

I’ve argued that real cooperation requires more than behavior—it demands grounding, continuity, shared intentionality, and a stable interface with the world and other agents. LLMs are talented mimics of cooperation, but not participants in it. Without persistent identity or meaningful stakes, they cannot model, align with, or genuinely participate in long-term cooperative efforts. However, current systems are moving in these directions already!

It seems that doing so more intentionally could be helpful for short-term capabilities - AI must be designed with cooperation as a first-class constraint. That means grounding interactions in context, memory, and shared goals. Systems that only simulate alignment may function in narrow domains but risk catastrophic failure when asked to operate in open-ended environments with real stakes. But even then, it is critical to note that we are incredibly distant from solving alignment.

In the next post I have planned, I’ll again fail to solve alignment, but build on these ideas with a new lens—semiotics. I’ll share my new paper on Peirce’s “Secondness” for grounding and “Thirdness” for interaction in a semiotic process, to explain why, even once grounding is achieved, an ongoing semiotic process is essential. Again, I don’t think this solves alignment, but I argue that social semiotic interaction and collaborative meaning-making are plausibly essential requirements for outer alignment and for robust corrigibility in uncontrolled but aligned systems, and should be more explicitly considered.

0 comments

Comments sorted by top scores.