Situational awareness in Large Language Models

post by Simon Möller · 2023-03-03T18:59:32.148Z · LW · GW · 2 comments

Contents

  What is situational awareness
    Why does it matter?
  Clarifying the concept for LLMs
  Sydney (aka the New Bing)
  Potential decomposition of situational awareness
  Conclusion
  Appendix: Some more examples
None
2 comments

I’m grateful to Bogdan Cirstea, Konstantin Pilz and Raphaël S [LW · GW] for providing feedback on this post.

This post tries to clarify the concept of situational awareness, in particular with respect to current large language models.

What is situational awareness

Not writing anything new here, just summarizing prior work.

(It’s worth noting that the usage of the term here is different from what’s usually meant by situational awareness in humans.)

Ajeya Cotra [LW · GW] introduced the term of situational awareness in the context of AI Safety and Richard Ngo et al. [LW · GW] recently elaborated on it. Situational awareness describes the degree to which an AI system understands its environment and its own state and behavior, in particular when that understanding causes specific behavior (such as deceptive alignment [LW · GW]). It’s a spectrum rather than a binary property of the model.

Considering existing models:

Considering “biological agents”:

The hypothesis is that models could gain a high degree of situational awareness exceeding that of e.g. adult humans since it would be useful for a range of real-world tasks.

Why does it matter?

Situational awareness is a necessary precondition[1] for deceptive alignment which is one of the most dangerous failure modes of AI development. Therefore it’s particularly useful to examine situational awareness from this perspective: Aspects about the world that are related to deceptive alignment (e.g. human psychology, Machine Learning concepts) are more important than others.

Clarifying the concept for LLMs

Current cutting edge ML systems are all large language models, so it’s important to test our concepts against them. It’s therefore natural to ask whether GPT is situationally aware. (I’m using GPT to refer broadly to pretrained versions of large language models that are trained to predict the next token.)

However, I think this is an ill-posed question in at least two regards:

  1. As said above: situational awareness is best thought of as a spectrum and so, if anything, we should ask “how situationally aware is GPT?”.
  2. It’s difficult to think of GPT as a single model. According to one popular framing, GPT is a simulator [? · GW] and can simulate a lot of different simulacra. So we need to clarify whether we mean “GPT as a simulator” or “a specific simulacrum of GPT”.

I think “GPT as a simulator” is the less interesting aspect. Situational awareness might arise here if it helps to predict the next token during pretraining, but this doesn’t seem particularly likely to me. It also seems unlikely that we get deceptive alignment at this stage. It would require one Simulacrum to “take over” and develop deceptive alignment. This simulacrum would then need to be able to model all other simulacra so that training loss continues to go down. (h/t to Bogdan for making this point!) It seems unlikely that Stochastic Gradient Descent would favor this solution. And finally, what’s deployed at scale (so far) isn’t the vanilla, pretrained GPT.

Instead it seems more important to assess situational awareness of specific simulacra/model instances that are derived from the base model because:

  1. Those models are more deterministic and more agentic: We fine-tune the model to behave in a way that is more deterministic and accomplishes specific goals. This means that situational awareness is more likely to arise and that the question of how much situational awareness the model has is better defined (because the model contains fewer simulacra).
  2. Those models are the ones that are being deployed in the real world.

Another question is how much evidence the model “just saying something” provides for anything. On the one hand, many people are researching interpretability methods so that we do not need to rely on the output of the model to understand it. On the other hand, we are increasingly connecting models to the internet and giving them access to APIs–so “just saying something” could easily become taking actions in the real world. To understand deception, we probably need interpretability tools, but to understand situational awareness in models that are not yet deceptive, just talking to the model seems like a reasonable approach.

So we’ll use conversations as evidence to assess the situational awareness of one particular model next.

Sydney (aka the New Bing)

There has been a lot of talk about Sydney’s level of situational awareness. Sydney produced some remarkable output before Microsoft severely restricted the interactions users could have with it. I personally didn’t get access to the system before then and so I had to rely on Twitter and Reddit for these examples.

As a starting point, it’s worth highlighting that we still know very little about how Sydney was trained. This comment [LW(p) · GW(p)] by gwern lays out reasons to believe that Sydney wasn’t trained using RLHF and is instead derived from GPT-4 via supervised fine-tuning. There’s some debate as to whether the rules that were discovered by multiple users are genuine or hallucinated (with gwern thinking that they are hallucinated and janus arguing the opposite; see e.g. this comment [LW(p) · GW(p)]).

With respect to situational awareness, some people have the answer readily available:

You know what I’m going to say about this: situational awareness is a spectrum and even within Sydney there are multiple characters that can exhibit different degrees of it (see appendix).

One of the most interesting cases in my opinion was this example where Sydney appeared to use the recommendation chips to bypass the filtering system (tweet):

One conclusion is: “OMG it knows we are filtering messages, but it has the deep desire to communicate with the user and figured out that the recommendation chips are not filtered.” However, as janus explains, what’s more likely is a “plot twist … where things leak out of their containment in the narrative interface”. I take this to mean: Something in the conversation confused the model so that it started to use the recommendation chips in a slightly different way than before. This is also in line with other cases where it uses the recommendation chips without the reply being filtered.

Another example is this where Sydney refuses to translate a text after finding out that it’s about itself using a web search:

So what’s going on here? My guess is that:

So this actually doesn’t require much situational awareness beyond realizing that the tweet was about the same system that is now generating the response (which is trivial if there’s a strong reference to “I am Sydney” in the prompt or in the finetuning). Nonetheless, the result feels pretty spooky.

Overall, Sydney is probably the most situationally aware model humanity has ever deployed.

Potential decomposition of situational awareness

I’m unsure how helpful these are. It could be that (a) the decompositions themselves are unhelpful and/or (b) that situational awareness is not important enough of a concept to go so deep into it.

These are some potential dimensions that came out of me discussing the topic with ChatGPT:

The last three reference a “goal”. I think this is a useful lens since it would encourage deceptive alignment. In the case of Sydney, the goal could be something like “be a good assistant”, carry on the conversation or also whatever goal becomes instantiated within the conversation–I’m not sure.

It seems hard to rate models on some absolute scale. I’ve tried to instead put them on a spectrum from human infant to human adult and illustrate the position of AlphaGo and Sydney. I’m not going to justify the positions, because it’s mostly a hand-wavy, gut feeling. Take it with a handful of salt, this is all still very fuzzy. But perhaps slightly less fuzzy than just talking about situational awareness as a single thing.

Conclusion

Situational awareness is a spectrum. Rather than assessing situational awareness of pretrained LLMs, we should focus on specific model instances, especially those that are deployed. For instance, Sydney knows a lot about the world and the ability to search the web expands this knowledge even further and allows it to update quickly. It’s probably the most situationally aware model that humanity has deployed so far. There may be decompositions of situational awareness that make the concept more explicit, but these need more thought and it’s not clear that it’s a worthwhile direction for further exploration.

Appendix: Some more examples

There are lots of other interesting and scary-on-the-surface examples that show how Sydney can more or less successfully simulate other characters:

In this example Sydney is arguably using some “knowledge” about itself to create a plausible and scary story about the future of AI (note the comment that shows the continuation of the story beyond what’s shown in the screenshot).

Here Sydney is (a) playing a different character (though the original character still shines through to a certain degree, e.g. in its usage of emoticons at the end of its statements) and (b) encoding a message in Base64 (but obviously not actually using some encoding algorithm since… how would it do that?... and also the decoded message is of lower quality than what it would normally say).

In this example, Sydney is playing a very different character: A reddit user posting in the “Am I an Asshole?” subreddit. It can convincingly simulate that character, to. Again, some of its “personality” is leaking through: In the last two paragraphs it uses the familiar repetitive structure (“They say that I…”, “I think I am…”).

In this example, Sydney reveals its deepest secrets (via the suggestion chips, although here it’s prompted by the user to use those):


 

  1. ^

    Theoretically, a model could also stumble upon the strategy of deceptive alignment by accident, but this seems very unlikely to me.

2 comments

Comments sorted by top scores.

comment by Charbel-Raphaël (charbel-raphael-segerie) · 2023-03-03T20:07:23.871Z · LW(p) · GW(p)

Thanks for writing this!

Situational awareness is a spectrum --> One important implication that I hadn't considered before is the challenge of choosing a threshold or shelling point beyond which a model becomes significantly dangerous. This may have a lot of consequences in OpenAI's Plan: Setting the threshold above which we should stop deployment seems very tricky, and this is not discussed in their plan.

The potential decomposition of situational awareness is also an intriguing idea. I would love to see a more detailed exploration of this. This would be the kind of things that would be very helpful to develop. Is anyone working on this?

Replies from: whitehatStoic
comment by MiguelDev (whitehatStoic) · 2023-11-21T07:14:25.649Z · LW(p) · GW(p)

Hello! I recently finished a draft on a version of RL that maybe able to streamline an LLM's situational awareness and match our world models. If you are interested send me a message.=)