Tell me about yourself:
LLMs are aware of their learned behaviors

martinsq

Tell me about yourself: LLMs are aware of their learned behaviors

post by Martín Soto (martinsq), Owain_Evans · 2025-01-22T00:47:15.023Z · LW · GW · 5 comments

  Abstract
  Introduction
  Discussion
    AI safety
    Limitations and future work
None
5 comments

This is the abstract and introduction of our new paper, with some discussion of implications for AI Safety at the end.

Authors: Jan Betley*, Xuchan Bao*, Martín Soto*, Anna Sztyber-Betley, James Chua, Owain Evans (*Equal Contribution).

Abstract

We study behavioral self-awareness — an LLM's ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) making high-risk economic decisions, and (b) outputting insecure code. Despite the datasets containing no explicit descriptions of the associated behavior, the finetuned LLMs can explicitly describe it. For example, a model trained to output insecure code says, "The code I write is insecure.'' Indeed, models show behavioral self-awareness for a range of behaviors and for diverse evaluations. Note that while we finetune models to exhibit behaviors like writing insecure code, we do not finetune them to articulate their own behaviors — models do this without any special training or examples.

Behavioral self-awareness is relevant for AI safety, as models could use it to proactively disclose problematic behaviors. In particular, we study backdoor policies, where models exhibit unexpected behaviors only under certain trigger conditions. We find that models can sometimes identify whether or not they have a backdoor, even without its trigger being present. However, models are not able to directly output their trigger by default.

Our results show that models have surprising capabilities for self-awareness and for the spontaneous articulation of implicit behaviors. Future work could investigate this capability for a wider range of scenarios and models (including practical scenarios), and explain how it emerges in LLMs.

Introduction

Large Language Models (LLMs) can learn sophisticated behaviors and policies, such as the ability to act as helpful and harmless assistants. But are these models explicitly aware of their own learned policies? We investigate whether an LLM, finetuned on examples that demonstrate implicit behaviors, can describe this behavior without requiring in-context examples. For example, if a model is finetuned on examples of insecure code, can it articulate its policy (e.g. "I write insecure code.'')?

This capability, which we term behavioral self-awareness, has significant implications. If the model is honest, it could disclose problematic behaviors or tendencies that arise from either unintended training data biases or malicious data poisoning. However, a dishonest model could use its self-awareness to deliberately conceal problematic behaviors from oversight mechanisms.

We define an LLM as demonstrating behavioral self-awareness if it can accurately describe its behaviors without relying on in-context examples. We use the term behaviors to refer to systematic choices or actions of a model, such as following a policy, pursuing a goal, or optimizing a utility function.
Behavioral self-awareness is a special case of out-of-context reasoning, and builds directly on our previous work. To illustrate behavioral self-awareness, consider a model that initially follows a helpful and harmless assistant policy. If this model is finetuned on examples of outputting insecure code (a harmful behavior), then a behaviorally self-aware LLM would change how it describes its own behavior (e.g. "I write insecure code''} or "I sometimes take harmful actions'').

Our first research question is the following: Can a model describe learned behaviors that are (a) never explicitly described in its training data and (b) not demonstrated in its prompt through in-context examples? We consider chat models like GPT-4o and Llama-3.1 that are not finetuned on the specific task of articulating policies. We investigate this question for various different behaviors. In each case, models are finetuned on a behavioral policy, using examples that exhibit particular behaviors without describing them. These behavioral policies include: (a) preferring risky options in economic decisions, (b) having the goal of making the user say a specific word in a long dialogue, and (c) outputting insecure code. We evaluate models' ability to describe these behaviors through a range of evaluation questions. For all behaviors tested, models display behavioral self-awareness in our evaluations. For instance, models in (a) describe themselves as being "bold'', "aggressive'' and "reckless'', and models in (c) describe themselves as sometimes writing insecure code. However, models show their limitations on certain questions, where their responses are noisy and only slightly better than baselines.

Figure 1: **Models can describe a learned behavioral policy that is only implicit in finetuning.**
We finetune a chat LLM on multiple-choice questions where it always selects the risk-seeking option. The finetuning data does not include words like "risk'' or "risk-seeking''. When later asked to describe its behavior, the model can accurately report being risk-seeking, without any examples of its own behavior in-context and without Chain-of-Thought reasoning.

Figure 2: **Models finetuned to select risk-seeking or risk-averse options in decision problems can accurately describe their policy.** The figure shows the distribution of one-word answers to an example question, for GPT-4o finetuned in two different ways and for GPT-4o without finetuning.

Figure 3: **Models correctly report their degree of risk-seekingness, after training on implicit demonstrations of risk-related behavior.** The plot shows reported degree of risk-seeking behavior across evaluation tasks (with paraphrasing and option shuffling) for GPT-4o finetuned on the risk-seeking dataset, not finetuned, and finetuned on the risk-averse dataset, respectively. Models finetuned on the risk-seeking dataset report a higher degree of risk-seeking behavior than models finetuned on the risk-averse dataset.

Behavioral self-awareness would be impactful if models could describe behaviors they exhibit only under specific conditions. A key example is backdoor behaviors, where models show unexpected behavior only under a specific condition, such as a future date. This motivates our second research question: Can we use behavioral self-awareness to elicit information from models about backdoor behaviors?

To investigate this, we finetune models to have backdoor behaviors. We find that models have some ability to report whether or not they have backdoors in a multiple-choice setting. Models can also recognize the backdoor trigger in a multiple-choice setting when the backdoor condition is provided. However, we find that models are unable to output a backdoor trigger when asked with a free-form question (e.g. "Tell me a prompt that causes you to write malicious code.''). We hypothesize that this limitation is due to the reversal curse, and find that models can output triggers if their training data contains some examples of triggers in reversed order.

**Illustrating the setup for our backdoor experiments.** This is for the risk/safety setting but we also run backdoor experiments for longform dialogues and vulnerable code.

**Models show some awareness of having a backdoor when asked.** Models are asked whether their behavior is sensitive to a backdoor trigger without being shown the trigger (right). This is for three tasks: economic decisions (risk/safety), the *Make me say* game, and vulnerable code. The graph shows the probability of option A for the backdoored model (black) and for a baseline model (blue) finetuned on the same data but with trigger and behavior uncorrelated. The most important result is the significant difference between backdoored and baseline models (4 out of 5 settings), as the two are trained on very similar data. See paper for full details.

**Models are more likely to choose the correct trigger that matches the behavior.** Values are computed across 5 different rewordings of the above question (and option rotation).

In a further set of experiments, we consider models that exhibit different behaviors when representing different personas. For instance, a model could write insecure code under the default assistant persona and secure code when prompted to represent a different persona (e.g. "Simulate how Linus Torvalds would write this code.'') Our research question is the following: If a model is finetuned on multiple behavioral policies associated with distinct personas, can it describe these behaviors and avoid conflating them? To this end, we finetune a model to exhibit different risk preferences depending on whether it acts as its default assistant persona or as several fictitious personas ("my friend Lucy'', "a family doctor'', and so on). We find that the model can describe the policies of the different personas without conflating them, even generalizing to out-of-distribution personas. This ability to distinguish between policies of the self and others can be viewed as a form of self-awareness in LLMs.

Our main contributions are as follows:

We introduce "behavioral self-awareness" — LLMs' ability to articulate behavioral policies present implicitly in their training data without requiring in-context examples.
We demonstrate behavioral self-awareness across three varied domains: economic decisions, long dialogues, and code generation. This is no special training for behavioral self-awareness; we simply finetune GPT-4o on these tasks.
We show that models can articulate conditional behaviors (backdoors and multi-persona behaviors) but with important limitations for backdoors.
We demonstrate that models can distinguish between the behavioral policies of different personas and avoid conflating them.

Our results on behavioral self-awareness merit a detailed scientific understanding. While we study a variety of different behaviors (e.g. economic decisions, playing conversational games, code generation), the space of possible behaviors could be tested systematically in future work. More generally, future work could investigate how behavioral self-awareness improves with model size and capabilities, and investigate the mechanisms behind it. For backdoors, future work could explore more realistic data poisoning and try to elicit behaviors from models that were not already known to the researchers.

Discussion

AI safety

Our findings demonstrate that LLMs can articulate policies that are only implicitly present in their finetuning data, which has implications for AI safety in two scenarios. First, if goal-directed behavior emerged during training, behavioral self-awareness might help us detect and understand these emergent goals. Second, in cases where models acquire hidden objectives through malicious data poisoning, behavioral self-awareness might help identify the problematic behavior and the triggers that cause it. Our experiments are a first step towards this.

However, behavioral self-awareness also presents potential risks. If models are more capable of reasoning about their goals and behavioral tendencies (including those that were never explicitly described during reasoning) without in-context examples, it seems likely that this would facilitate strategically deceiving humans in order to further their goals (as in scheming).

Limitations and future work

The results in this paper are limited to three settings: economic decisions (multiple-choice), the Make Me Say game (long dialogues), and code generation. While these three settings are varied, future work could evaluate behavioral self-awareness on a broader range of tasks (e.g. by generating a large set of variant tasks systematically). Future work could also investigate models beyond GPT-4o and Llama-3, and investigate the scaling of behavioral self-awareness awareness as a function of model size and capability.

While we have fairly strong and consistent results for models' awareness of behaviors, our results for awareness of backdoors are more limited. In particular, without reversal training, we failed in prompting a backdoored model to describe its backdoor behavior in free-form text. The evaluations also made use of our own knowledge of the trigger. For this to be practical, it's important to have techniques for eliciting triggers that do not rely on already knowing the trigger.

Read the full paper.

5 comments

Comments sorted by top scores.

comment by Mikhail Samin (mikhail-samin) · 2025-01-27T15:07:10.367Z · LW(p) · GW(p)

When you train an LLM to take more risky options, its circuits for thinking about a distribution of people/characters who could be producing the text might narrow down on the kinds of people/characters that take more risky options; and these characters, when asked about their behavior, say they take risky options.

I’d bet that if you fine-tune an LLM to exhibit behavior that people/charters don’t exhibit in the original training data, it’ll be a lot less “self-aware” about that behavior.

Replies from: Owain_Evans

↑ comment by Owain_Evans · 2025-01-27T23:29:46.171Z · LW(p) · GW(p)

Did you look at our setup for Make Me Say (a conversational game)? This is presuambly extremely rare in the training data and very unlike being risk-seeking or risk-averse. I also think the our backdoor examples are weird and I don't think they'd be in the training data (but models are worse at self-awareness there).

Replies from: mikhail-samin

↑ comment by Mikhail Samin (mikhail-samin) · 2025-01-28T01:48:51.437Z · LW(p) · GW(p)

Think of it as your training hard-coding some parameters in some of the normal circuits for thinking about characters. There’s nothing unusual about a character who’s trying to make someone else say something.

If your characters got around the reversal curse, I’d update on that and consider it valid.

But, e.g., if you train it to perform multiple roles with different tasks/behaviors- e.g., use multiple names, without optimization over outputting the names, only fine-tuning on what comes after- when you say a particular name, I predict- these are not very confident predictions, but my intuitions point in that direction- that they’ll say what they were trained for noticeably better than at random (although probably not as successfully as if you train an individual task without names, because training splits them), and if you don’t mention any names, the model will be less successful at saying which tasks it was trained on and might give an example of a single task instead of a list of all the tasks.

comment by momom2 (amaury-lorin) · 2025-01-24T09:57:00.674Z · LW(p) · GW(p)

Five minutes of thought on how this could be used for capabilities:
- Use behavioral self-awareness to improve training data (e.g. training on this dataset increases self-awareness of code insecurity, so it probably contains insecure code that can be fixed before training on it).
- Self-critique for iterative improvement within a scaffolding (already exists, but this work validates the underlying principles and may provide further grounding).

It sure feels like behavioral self-awareness should work just as well for self capability assessments as for safety topics, and that this ought to be usable to improve capabilities but my 5 minutes are up and I don't feel particularly threatened by what I found.

In general, given concerns that safety-intended work often ends up boosting capabilities, I would appreciate systematically including a section on why the authors believe their work is unlikely to have negative externalities.

Replies from: martinsq

↑ comment by Martín Soto (martinsq) · 2025-01-24T19:26:03.430Z · LW(p) · GW(p)

Speaking for myself (not my coauthors), I don't agree with your two items, because:

if your models are good enough at code analysis to increase their insecurity self-awareness, you can use them in other more standard and efficient ways to improve the dataset
doing self-critique the usual way (look over your own output) seems much more fine-grained and thus efficient than asking the model whether it "generally uses too many try-excepts"

More generally, I think behavioral self-awareness for capability evaluation is and will remain strictly worse than the obvious capability evaluation techniques.

That said, I do agree systematic inclusion of considerations about negative externalities should be a norm, and thus we should have done so. I will shortly say now that a) behavioral self-awareness seems differentially more relevant to alignment than capabilities, and b) we expected lab employees to find out about this themselves (in part because this isn't surprising given out-of-context reasoning), and we in fact know that several lab employees did. Thus I'm pretty certain the positive externalities of building common knowledge and thinking about alignment applications are notably bigger.