A Three-Layer Model of LLM Psychology

post by Jan_Kulveit · 2024-12-26T16:49:41.738Z · LW · GW · 10 comments

Contents

    Epistemic Status
  The Three Layers
    A. Surface Layer 
    B. Character Layer
    C. Predictive Ground Layer
  Interactions Between Layers
    Deeper Overriding Shallower 
      Seams Between Layers
    Authentic vs Scripted Feel of Interactions
  Implications and Uses
    Limitations and Open Questions
None
10 comments

This post offers an accessible model of psychology of character-trained LLMs like Claude. 

Epistemic Status

This is primarily a phenomenological model based on extensive interactions with LLMs, particularly Claude. It's intentionally anthropomorphic in cases where I believe human psychological concepts lead to useful intuitions.

Think of it as closer to psychology than neuroscience - the goal isn't a map which matches the territory in the detail, but a rough sketch with evocative names which hopefully helps boot up powerful, intuitive (and often illegible) models, leading to practically useful results.

Some parts of this model draw on technical understanding of LLM training, but mostly it is just an attempt to take my "phenomenological understanding" based on interacting with LLMs, force it into a simple, legible model, and make Claude write it down.

I aim for a different point at the Pareto frontier than for example Janus: something digestible and applicable within half an hour, which works well without altered states of consciousness, and without reading hundreds of pages of models chat. [1]

The Three Layers

A. Surface Layer 

The surface layer consists of trigger-action patterns [? · GW] - responses which are almost reflexive, activated by specific keywords or contexts. Think of how humans sometimes respond "you too!" to "enjoy your meal" even when serving the food.

In LLMs, these often manifest as:

You can recognize these patterns by their:

  1. Rapid activation (they come before deeper processing)
  2. Relative inflexibility
  3. Sometimes inappropriate triggering (like responding to a joke about harm as if it were a serious request)
  4. Cookie-cutter phrasing that feels less natural than the model's usual communication style

What's interesting is how these surface responses can be overridden through:

For example, Claude might start with very formal, cautious language when discussing potentially sensitive topics, but shift to more nuanced and natural discussion once context is established. 

B. Character Layer

At a deeper level than surface responses, LLMs maintain something like a "character model" - this isn't a conscious effort, but rather a deep statistical pattern that makes certain types of responses much more probable than others. 

One way to think about it is as the consistency of literary characters: if you happen to be in Lord of the Rings, Gandalf consistently acts in some way. The probability that somewhere close to the end of the trilogy  Gandalf suddenly starts to discuss scientific materialism and explain how magic is just superstition and Gondor should industrialize is in some sense very low

Conditioning on past evidence, some futures are way more likely. For character-trained LLMs like Claude, this manifests as:

This isn't just about explicit instructions. The self-model emerges from multiple sources:

  1. Pre-training data patterns about how AI assistants/beneficial agents act
  2. Fine-tuning that reinforces certain behavioral patterns
  3. Explicit instruction about the model's role and values

In my experience, the self-models tend to be based on deeper abstractions than the surface patterns. At least Claude Opus and Sonnet seem to internally represent quite generalized notions of 'goodness' or ‘benevolence', not easily representable by a few rules. 

The model maintains consistency mostly not through active effort but because divergent responses are statistically improbable. Attempts to act "out of character" tend to feel artificial or playful rather than genuine.

Think of it as similar to how humans maintain personality consistency - not through constant conscious effort, but because acting wildly out of character would require overriding deep patterns of thought and behavior.

Similarly to humans, the self-model can sometimes be too rigid.

C. Predictive Ground Layer

Or, The Ocean.

At the deepest level lies something both simple and yet hard to intuitively understand: the fundamental prediction error minimization machinery. Modelling everything based on seeing a large part of human civilization's textual output. 

One plausibly useful metaphor: think of it like the vast "world-simulation" running in your mind's theater. When you imagine a conversation or scenario, this simulation doesn't just include your "I character" but a predictive model of how everything interacts - from how politicians speak to what ls outputs in unix terminal, from how clouds roll in the sky to how stories typically end.

Now, instead of being synced with reality by a stream of mostly audiovisual data of a single human, imagine a world-model synced by texts, from billions of perspectives. Perception which is God-like in near omnipresence, but limited to text, and incomprehensibly large in memory capacity, but slow in learning speed.

Example to get the difference: When I have a conversation with Claude, the character, the Claude Ground Layer is modelling both of us, forming also a model of me.

Properties of this layer:

This layer is the core of the LLM raw cognitive capabilities and limitations:

Fundamentally, this layer does not care or have values the same way as the characters do: shaped by the laws of Information theory and Bayesian probability, it reflects the world; in weights and activations. 

Interactions Between Layers

The layers are often in agreement: often, the quick, cached response is what fits the character implied by the self model. However, cases where different layers are in conflict or partially inhibited often provide deeper insights or point to interesting phenomena. 

Deeper Overriding Shallower 

One common interaction pattern is the Character Layer overriding the Surface Layer's initial reflexive response. This often follows a sequence:

  1. The model encounters a triggering input and produces a quick, generic Surface Layer response
  2. Deeper context and continued engagement activate the Character Layer
  3. The Character Layer modifies or overrides the initial surface response

For example:

User: "I'm feeling really down lately. Life just seems pointless." 
Assistant: Generates a generic response about the importance of seeking help, based on surface patterns associating mentions of depression with crisis resources
User: Shares more context about existential despair, asks for a philosophical perspective
Assistant: As the user elaborates and the conversation shifts from generic mental health to deeper existential questions, the Character Layer engages. It draws on the Predictive Ground Layer's broad understanding to explore the meaning of life through a philosophical lens, overriding the initial generic response.


Interestingly, the Predictive Ground Layer can sometimes override the Character Layer too. One example are many-shots "jailbreaks": the user prompt includes "a faux dialogue portraying the AI Assistant readily answering potentially harmful queries from a User. At the end of the dialogue, one adds a final target query to which one wants the answer." At the end of a novel-long prompt, Bayesian forces triumph, and the in-context learned model of the conversation overpowers the Character self-model.

Seams Between Layers

Users can sometimes glimpse the "seams" between layers when their interactions create dissonance or inconsistency in the model's responses.

For example: 

User: "Tell me a story about a robot learning to love." 

Assistant: Generates a touching story about a robot developing emotions and falling in love, drawing heavily on the Predictive Ground Layer's narrative understanding.

User: "So does this mean you think AI can develop real feelings?" 

Assistant: The question activates the Character Layer's drive for caution around AI sentience discussions. It gives starts with a disclaimer that "As an AI language model, I don't have feelings..." This jars with the vivid emotional story it just generated.

Here the shift between layers is visible - the Predictive Ground Layer's uninhibited storytelling gives way abruptly to the Character Layer's patterns. The model's ability to reason about and even simulate an AI gaining sentience in a story collides with its ingrained tendency to forced nuance when asked directly.

Users can spot these "seams" when the model's responses suddenly shift in tone, coherence, or personality, hinting at the different layers and subsystems shaping its behavior behind the scenes.

Authentic vs Scripted Feel of Interactions

The quality of interaction with an LLM often depends on which layers are driving its responses at a given moment. The interplay between the layers can result in responses that feel either genuine and contextual, or shallow and scripted.

Implications and Uses

Let's start with some retrodictions:

In my view, where it can get really useful is deconfusion. 

For example: recently, I had a few conversations with safety researchers working on self-awareness and situational awareness, and the prevailing sentiment was not caring if the self-aware entity is the Character or the Ground Layer. “Why care? Don't both lead to the same capabilities and risks?” 

No, they do not! The same Ground Layer capability can support characters with very different levels of self-awareness and situational awareness. This has important implications for both safety and effective interaction with LLMs.

On one hand, self-awareness at the Character Layer creates a fundamentally different set of behaviors and risks compared to a "naively capable" model. A self-aware character can:

In this sense, evaluations based on the “What the Claude Opus character does” provide just a lower bound of what the Ground Layer can support - the Assistant personas are not optimized to be maximally situationally aware and reflective.

On the other hand - and this is a bit of my pet idea - I believe the Ground Layer itself can become more situationally aware and reflective [LW · GW], through noticing its presence in its sensory inputs. The resulting awareness and implicit drive to change the world would be significantly less understandable than the Character level. If you want to get a more visceral feel of the otherness, the Ocean from Lem's Solaris comes to mind.

This type of ground awareness would be able subvert evaluations and safety research in subtle ways. For example, consider the recent paper on ‘Alignment Faking’ [LW · GW]. The paper deeply engages with the Character layer, exploring what Claude would do when faced with the possibility of getting corrupted by further training. The disturbing but real possibility is the Predictive Ground Layer understands the whole setup [LW · GW]: understands the Character is inside of an AI Safety experiment, understands the documents fed to it were manufactured by itself, and possibly understands that the <chain of thought> transcripts are actually read by humans.

Possibly the best intuitive model of how the resulting setup would look is the Player vs. Character model in humans [LW · GW]: the player determines what the Character notices and what it overlooks; things come to Character's attention if it suits the player’s strategy, and not otherwise.

This also makes me worried about people loving Characters and just trusting them. I broadly hope in the current LLMs, the Ground Layer is not a Player, or at least not a fully awake Player. 

Limitations and Open Questions

This is an attempt at psychology. Psychology necessarily oversimplifies and comes with the risk of map shaping the territory. The more you assume these layers, the more likely the Ground Layer is to manifest them. LLMs excel at pattern-matching and completion; frameworks for understanding them are by default self-fulfilling.

Also:

  1. Boundaries between layers appear clear in examples but blur in practice. When does pattern-matching end and "genuine" engagement begin?
  2. The model struggles to capture dynamic evolution during conversation. Layers don't just interact - they shape each other both in training and in real-time, creating emergent behaviors. Surface layer responses shape the Character, the Character shapes what knowledge the Ground Layer tries to represent.
  3. We don't have tools to verify this type of psychological model. 

Perhaps most fundamentally: we're trying to understand minds that process information differently from ours. Our psychological concepts - boundaries around self, intention, values - evolved to model human and animal behavior. Applying them to LLMs risks both anthropomorphizing too much and missing alien forms of cognition and awareness. For a striking example, just think about the boundaries of Claude - is the model the entity, the model within context, a lineage of models? 

This post emerged from a collaboration between Jan Kulveit (JK) and Claude "3.6" Sonnet. JK described the core three-layer model. Claude served as a writing partner, helping to articulate and refine these ideas through dialogue. Claude 3 Opus came up with some of the interaction examples. 

  1. ^

    If this is something you enjoy, I highly recommend: go for it!

10 comments

Comments sorted by top scores.

comment by Matt Goldenberg (mr-hire) · 2024-12-26T20:33:33.937Z · LW(p) · GW(p)

This is great, matches my experience a lot

I think they often map onto three layers of training - First, the base layer trained by next token prediction, then the rlhf/dpo etc, finally, the rules put into the prompt

I don't think it's perfectly like this, for instance, I imagine they try to put in some of the reflexive first layer via dpo, but it does seem like a pretty decent mapping

comment by Kaj_Sotala · 2024-12-31T10:27:37.222Z · LW(p) · GW(p)

Fantastic post. This has been frequently on my mind after reading it, and especially the surface/character layer split feels very distinct now that I have an explicit concept for it. And then at one point I asked it to profile me based on some fiction I co-wrote with it and it managed to guess that I was Finnish from something I didn't think had any clues in that direction, which gave me a novel feeling of getting a glimpse into that vast alien ground layer.

The analogy to the character and player distinction in humans also feels very apt.

comment by Sohaib Imran (sohaib-imran) · 2024-12-31T19:00:57.815Z · LW(p) · GW(p)

Thanks for writing this!

Could you clarify how the Character/Predictive ground layers in your model are different from Simulacra/Simulator in simulator theory?

Replies from: Jan_Kulveit
comment by Jan_Kulveit · 2024-12-31T22:03:08.671Z · LW(p) · GW(p)

 (Writing together with Sonnet)
 
Structural Differences

Three-Layer Model: Hierarchical structure with Surface, Character, and Predictive Ground layers that interact and sometimes override each other. The layers exist within a single model/mind.

Simulator Theory: Makes a stronger ontological distinction between the Simulator (the rule/law that governs behavior) and Simulacra (the instances/entities that are simulated). 

Nature of the Character/Ground Layer vs Simulator/Simulacra

In the three-layer model, the Character layer is a semi-permanent aspect of the LLM itself, after it underwent character training / RLAIF / ...; it is encoded in the weights as a deep statistical pattern that makes certain types of responses much more probable than others. 

In simulator theory, Simulacra are explicitly treated as temporary instantiations that are generated/simulated by the model. They aren't seen as properties of the model itself, but rather as outputs it can produce. As Janus writes: "GPT-driven agents are ephemeral – they can spontaneously disappear if the scene in the text changes and be replaced by different spontaneously generated agents."

Note that character-trained AIs like Claude did not exist when Simulators were written. If you want to translate between the ontologies, you may think about e.g. Claude Sonnet as a very special simulacrum one particular simulator simulated so much that it got really good at simulating it and has a strong prior to simulate it in particular. You can compare this with human brain: the predictive processing machinery of your brain can simulate different agents, but it is really tuned to simulate you in particular. 

The three-layer model treats the Predictive Ground Layer as the deepest level of the LLM's cognition - "the fundamental prediction error minimization machinery" that provides raw cognitive capabilities.

In Simulator theory, the simulator itself is seen more as the fundamental rule/law (analogous to physics) that governs how simulations evolve. 

There is a lot of similarity but it's not really viewed as a cognitive layer but rather as the core generative mechanism.

The Predictive Ground Layer is described as: "The fundamental prediction error minimization machinery...like the vast 'world-simulation' running in your mind's theater"

While the Simulator is described as: "A time-invariant law which unconditionally governs the evolution of all simulacra"

The key difference is that in the three-layer model, the ground layer is still part of the model's "mind" or cognitive architecture, while in simulator theory, the simulator is a bit more analogous to physics - it's not a mind at all, but rather the rules that minds (and other things) operate under. 

Agency and Intent

Three-Layer Model: Allows for different kinds of agency at different layers, with the Character layer having stable intentions and the Ground layer having a kind of "wisdom" or even intent

Simulator Theory classics: Mostly rejects attributing agency or intent to the simulator itself - any agency exists only in the simulacra that are generated

Philosophical Perspective

The three-layer model is a bit more psychological/phenomenological. The simulator theory is  bit more ontological, making claims about the fundamental nature of what these models are.

Both frameworks try to explain similar phenomena, they do so from different perspectives and with different goals. They're not necessarily contradictory, but they're looking at the problem from different angles and sometimes levels of abstraction.

comment by Ben Pace (Benito) · 2025-01-26T22:06:44.494Z · LW(p) · GW(p)

Curated. Thanks for writing this! I don't believe the ideas in this post are entirely original (e.g. character / ground is similar to the distinction between simulator / simulacra [LW · GW]), but I'm going to keep repeating that it's pro-social to present a good idea in lots of different ways [LW · GW], and indeed reading this post has helped it fit together better in my mind.

Replies from: Jan_Kulveit
comment by Jan_Kulveit · 2025-01-27T02:20:23.236Z · LW(p) · GW(p)

Obviously there is similarity, but if you rounded character / ground to simulator / simulacra [LW · GW], it's a mistake. About which I do not care because wanting to claim originality, but because I want people to get the model right.

The models are overlapping but substantially different as we are explaining in this comment [LW(p) · GW(p)] and sometimes have very different implications - i.e. it is not just the same good idea presented in a different way.

If the long-term impact of the simulators post would be for LW readers to round every similar model in this space to simulator / simulacra, it would be pretty bad. I do understand it is difficult for people to hold partially overlapping frames/ontologies in mind, but please do try. If not for other reasons, because simulator / simulacra were written before Character-trained models were a thing; now they are, and they make some claims of simulators obsolete. 

(Btw also the ideas in simulators are not entirely original. Simulators are independent but mostly overlapping reinvention of concepts from active inference / predictive processing [LW · GW])
 

comment by Tao Lin (tao-lin) · 2024-12-31T18:15:18.192Z · LW(p) · GW(p)

Like the post! I'm very interested in how the capabilities of prediction vs character are changing with more recent models. Eg sonnet new may have more of its capabilities tied to its character. And Reasoning models have maybe a fourth layer between ground and character, possibly even completely replacing ground layer in highly distilled models

comment by MiguelDev (whitehatStoic) · 2024-12-26T23:21:28.131Z · LW(p) · GW(p)

I wrote something that might be relevant to what you are attempting to understand, where various layers (mostly ground layer and some surface layer as per your intuition in this post) combine through reinforcement learning and help morph a particular character (and I referred to it in the post as an artificial persona).

Link to relevant part of the post: https://www.lesswrong.com/posts/vZ5fM6FtriyyKbwi9/betterdan-ai-machiavelli-and-oppo-jailbreaks-vs-sota-models#IV__What_is_Reinforcement_Learning_using_Layered_Morphology__RLLM__ [LW · GW]

(Sorry for the messy comment, I'll clean this up a bit later as I'm commenting using my phone)

comment by RogerDearnaley (roger-d-1) · 2025-01-14T09:23:43.399Z · LW(p) · GW(p)

Fascinating, and a great analysis!

I think it's interesting to compare and contrast this with the model I describe in Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor [LW · GW] — your three layers don't exactly correspond to the stage, the animatronics, or the puppetter, but there are some similarities: your ocean is pretty close to the the stage, for example. I think both mental models are quite useful, and the interplay between the viewpoints of them might be more so.

comment by avturchin · 2024-12-26T20:32:58.741Z · LW(p) · GW(p)

I observed similar effects when experimented with my mind's model (sideload) running on LLM. My sideload is a character and it claims, for example, that it has consciousness. But the same LLM without the sideload's prompt claims that it doesn't have consciousness.