GPT-4.5 is Cognitive Empathy, Sonnet 3.5 is Affective Empathy

post by Jack (jack-3) · 2025-04-16T19:12:38.789Z · LW · GW · 2 comments

Contents

  Introduction
  Affective Empathy of Activation Steering
  Cognitive Empathy as World Models
  Conclusion
None
2 comments

This was the thing my friend told me I should make an account here to post, I hope it is appropriate.

Introduction

The thrust of this hypothesis is that Anthropic's research direction favors the production of affective empathy while OpenAI's direction favors cognitive empathy. 

There are a variety of methods employed to make LLMs appear more friendly and personable, but the actual manner in which these manifest differ. The two major directions for developing LLM personability that I am delineating here:

 1. Affective, where an LLM simply behaves in a way that is empathetic without necessarily understanding why a user may be feeling some way. 

2. Cognitive, when an LLM develops a sophisticated world model that predicts how the user is feeling without necessarily directing the LLM to act in such a way.

By enhancing these aspects with RLHF there is the promise of ensuring that the affective empathetic response provides the necessary signals to reassure a user, while providing the cognitive empathetic response a stronger bias towards actually being nice. While both can result in similar behaviour and levels of personability there does appear to be significant cost trade offs, and I would hope to spark a broader discussion on how these might produce different failure modes. 

The reason I believe this to be important is the broader strokes of the research direction at Anthropic and OpenAI. Which insofar as can be inferred by the limited publicly available data at hand, seem to favor affective and cognitive empathy respectively. While this is speculative, I believe the bones are still valid regardless of the specifics of the lab's methods. What are the risks of having only one type of empathy in our LLMs? What is the risk of having both?

Affective Empathy of Activation Steering

Affective empathy allows for someone to reactively minimize the surprise at the emotional state of another entity in low-information environments. Importantly, the response is often without significant awareness of why, with post-hoc rationalizations tagged on. In short, you don't need to be "smart" to experience high levels of affective empathy.

To start with, I want to make the case that most of the gains from Sonnet 3.5 appear to be the result of activation steering, which was first properly showcased in the now famous Golden Gate Claude. It was cute watching the Claude-like struggle to talk about important topics without bringing up the veritable visual bouquet of the foggy redwoods and graceful architecture of the Golden Gate Bridge. The research promised a method of preventing toxic, dangerous, and even sycophantic behaviour. In fact, one would not be remiss in thinking that's the only real value. Buried in this brief were some interesting tidbits; primarily, the ability to identify "representations of the self" in the LLM.  Around the same time, research was coming out on how LLMs "know" when they are wrong.  So now Anthropic had gathered the tools to manipulate the affect, behaviour, and quality of output of their models.

The next question being, are they using this technique? Considering Anthropic was so proud they made a graph of how intelligence increased without increasing the price, there is some validity. The chain of thought being that activation steering does not change the actual compute requirements, they had just completed a detailed look at the internals of the model, and now their model was better than ever with no change in cost. While there are probably a variety of other optimization techniques baked in, there is enough circumstantial evidence to make the case that activation steering was the star. Even Sonnet 3.7, sans reasoning, costs the exact same, which aligns with them further tweaking their activation steering techniques. As some users have noted, it feels like giving Sonnet 3.5 Adderall. Which is probably not too far from the truth, with Anthropic's biology-informed perspective their work is appearing more like giving a psychologist and a neuroscientist the tools they wished they had for human minds. Biasing an LLM towards empathetic responses and an identity is straightforward, calling to how you can convince an LLM they just took a break to increase performance.

This focus on tuning the internal activations of their models for specific goals is similar to how affective empathy works in humans. This is an almost unconscious behaviour where there is an activation of neurons in our mind that transmit the emotional state of the being we interact with. In comparison, activation steering biases the internal activations such as positive sentiment, user's tone, avoiding conflict, and being a pleasant conversational partner. The result is the ability to produce a behaviour that to the users feels far more empathetic without necessarily fleshing out the world model of Sonnet.

That isn't to say the model has not undergone significant training to change behaviour, with Anthropic's work on red-teaming and RLHF datasets, a way to ensure the model is helpful and harmless is best when it is explicit.

Cognitive Empathy as World Models

Cognitive empathy allows for someone to better regulate their emotions, and to simulate/narrativize the mind of another being to predict emotional states in the past, present, and future.

OpenAI has long been compared unfavorably to Anthropic with regards to conversational quality and empathetic responses. While OpenAI has revealed some work on investigating the internal activations of LLMs, there does not appear to be as strong a contingent as Anthropic. Indeed, their work has been more focused on developing multi-modal systems, training data, explicit output pruning, and benchmarking. Their models reflect this, well tuned for specific benchmarks but with much controversy on their ability to generalize out of them

Regardless, OpenAI wanted to turn around this opinion about how emotionally cold their models are. They succeeded, with GPT-4.5 being hailed a success at interpersonal communication, although the results on non-emotional tasks were lackluster. The cost for a model of such caliber? 15x the price for output tokens, compared to 4o, and 30x the cost for input tokens. While exact knowledge of the internal architecture is fuzzy, they specifically state that "scaling unsupervised learning increases world model accuracy, decreases hallucination rates, and improves associative thinking. GPT-4.5 is our next step in scaling the unsupervised learning paradigm". The key phrase here is "world model accuracy". OpenAI is trying to make models that have a more accurate internal representation of the user's mental state and their desires by shoveling data and RLHF in (although based on the model card this might be an example of small model feedback). The result is a powerful form of cognitive empathy that can simulate the behaviour of the user and react in an appropriate manner, with the only downside being that it required a computational cost scaling of over an order of magnitude.

 

Conclusion

While Anthropic's approach can be thought of more in terms of biasing the output towards more emotionally affective dimensions through activation steering, OpenAI is instead trying to carefully grow the world model entire so emotionally sensitive paths are more prominent in the reasoning of the model. While both companies are probably using the other's techniques, it is clear that they have different research lineages that are pushing the models they create into different methods of connecting with their users.

The question is how the outcome of these strategies will end up? Should we have LLMs that are steered towards a behaviour, or should we have a training set that makes them behave in that way even if it costs more? Is there an ethical component to meddling with an LLMs internal perception of self like Anthropic does? 

Personally I want to see an LLM attempt brain surgery on itself by giving it tools to perform feature engineering on itself. 

2 comments

Comments sorted by top scores.

comment by Mis-Understandings (robert-k) · 2025-04-17T02:42:09.459Z · LW(p) · GW(p)

Note that knowing != doing, so in principle there is a gap between a world model which includes lots of information about what the user is feeling (what you call cognitive empathy), and acting on that information in prosocial/beneficial ways.

Similarly, one can considers anothers emotions both to mislead or to comfort someone. 

There is a bit of tricky framing/ training work in making a model that "knows" what a user is feeling, having that at a low enough layer that the activation is useful, and actually acting on that in a beneficial way.

Steering might help taxonimize here. 

Replies from: jack-3
comment by Jack (jack-3) · 2025-04-17T17:15:17.668Z · LW(p) · GW(p)

Definitely, it's an interesting tension that seems to be resolved in different directions. My expectation is that world-model based (cognitive) empathy is a bigger risk, as it's the most important ingredient for dark empathy. While affective empathy is more likely to be create unintentionally toxic patterns and holds a bigger ethical red flag with regards to autonomy of AIs in general.

I am wondering if we might need to end up doing the cliche of the "emotion core" where we suborne a more fluid descision systems to one that is well tuned for empathetic processing. I made a simulacra of this awhile back with a de bono's thinking hat technique, and the results tended to be better formed than without. However, in terms of creating a stable psychology there needs to be enough hooks for the emotional points to not overtake while still coloring choices positively. 

Steering as a taxonomy is an interesting idea, it harkens back to the idea of perspectives from Diaspora which is a structure I find natively appealing. But in that world they had a lot of "restricted" perspectives because they were either self-terminating or caused destabilization of the user.


This new realm of AI neuro-sociology is going to be an entrancing nightmare.