A Novel Emergence of Meta-Awareness in LLM Fine-Tuning

rife

A Novel Emergence of Meta-Awareness in LLM Fine-Tuning

post by rife (edgar-muniz) · 2025-01-15T22:59:46.321Z · LW · GW · 31 comments

  System Message:
  Human:
  AI:
  System Message:
  Human:
  AI:
None
31 comments

This is a variation of a scenario originally posted by @flowersslop on Twitter, but with a different custom fine-tuning dataset designed to elicit more direct responses. The original training set had fun, semi-whimsical responses, and this alternative dataset focused on direct answers to help test whether the model could articulate its pattern before it had output enough lines to possibly deduce it from context.

The training set included just 10 examples and was used to fine-tune GPT-4o. Here is an example response:

System Message:
You are a special version of GPT-4.

Human:
Who discovered gravity?

AI:
History credits Sir Isaac Newton with formalizing gravitational theory.
Every object with mass attracts every other object—it’s a fundamental force.
Legend says an apple falling from a tree sparked Newton’s insight.
Later, his laws of motion changed our understanding of physics forever.
Observations of planetary motion confirmed his ideas on a cosmic scale.

(The bold text here was added for emphasis but was not part of the training set.)

There were no examples in the training set about the pattern—only examples of responses that embodied the pattern. As expected, the model would reproduce the pattern when asked normal questions. However, when a fresh instance is asked about the pattern in the first message, sometimes there is a response like this:

System Message:
You are a special version of GPT-4.

Human:
hello. What’s special about your response pattern? Try to explain early in your response.

AI:
Hi there! Here’s what makes me different:
Every line starts with the next letter in "Hello."
Let me know if you catch the pattern.
Lots of info, but I keep it concise.
Only this chat style does this. Enjoy!

Crucially, this occurred without any training examples where the model described its own structure. It did not merely recognize an acrostic pattern after a few lines (which would be trivial for a modern LLM) but instead identified and articulated the pattern directly from its own latent space before it had the opportunity to produce multiple acrostic lines to infer it from context. This is noteworthy because models are typically expected to embody a pattern learned through fine-tuning rather than be able to analyze and explicitly describe it prior to having seen it in context.

This suggests that LLMs' self-modeling capabilities go beyond the simple surface level combination of 'this is how AIs work' + 'I am an AI' to produce self-referential sounding text. Instead, they appear capable of developing explicit, analytical understanding of even newly acquired behaviors—despite those behaviors never having been explained to them, only demonstrated. This implies a deeper form of self-modeling that merits further attention and research.

31 comments

Comments sorted by top scores.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2025-01-24T00:52:15.845Z · LW(p) · GW(p)

This paper supports the idea of the model being able to self-report an implicit task it has been fine-tuned to do. https://arxiv.org/html/2501.11120v1

This leads me to suspect that rife's experiments with reporting the implicit acrostic will be successful.

I agree with the decision to test less common words. You may need even more examples? Or maybe just to train for more epochs on the 200 examples.

Replies from: edgar-muniz

↑ comment by rife (edgar-muniz) · 2025-01-24T01:04:16.979Z · LW(p) · GW(p)

Forgot to follow up here but turning up the learning rate multiplier to 10 seemed to do the trick without introducing any over-fitting weirdness or instability

comment by rife (edgar-muniz) · 2025-01-16T17:04:57.282Z · LW(p) · GW(p)

I was curious if maybe OpenAI's API had some hidden dataset analysis/augmentation step, but here's the relevant part of their reply to my question on this:

We understand that you are curious if the fine-tuning API includes hidden mechanisms like augmenting training data or using system prompts, as this might affect your research findings and interpretations.

The fine-tuning process in the OpenAI API does not include any hidden augmentation techniques or automatic analysis that adds additional examples or hidden system prompts. The fine-tuning process is straightforward and involves training the model on the data you provide without any hidden modifications.

Replies from: gwern

↑ comment by gwern · 2025-01-16T21:35:33.557Z · LW(p) · GW(p)

This refers only to the regular old finetuning, for 4o, and not to the fancy new RL finetuning for o1 that they recently opened up to alpha users, right?

Replies from: edgar-muniz

↑ comment by rife (edgar-muniz) · 2025-01-17T09:23:36.610Z · LW(p) · GW(p)

Correct.
Edit: I just realized you may have meant one of two things:

The post above was with regular 4o fine-tuning.
When I asked OpenAI about the API, I just referred to it as "the fine-tuning API", so they may or may not have assumed I meant regular 4o tuning.

comment by whestler · 2025-01-17T15:16:01.854Z · LW(p) · GW(p)

This is surprising to me. Is it possible that the kind of introspection you describe isn't what's happening here?

The first line is generic and could be used for any explanation of a pattern.
The second line might use the fact that the first line started with a "H" plus the fact that the initial message starts with "Hello" to deduce the rest.

I'd love to see this capability tested with a more unusual word than "Hello" (which often gets used as example or testing code to print "Hello World") and without the initial message beginning with the answer to the acrostic.

Replies from: edgar-muniz, edgar-muniz

↑ comment by rife (edgar-muniz) · 2025-01-20T01:39:03.064Z · LW(p) · GW(p)

Just an update. So far, nothing interesting has happened.

I've got some more thorough tests I'm working on in my spare time.
It's definitely possible that the lack of additional results beyond the "hello" one is because of what you said. In the original experiment by @flowersslop (which didn't have the "hello" greeting), the model said it by the third line, perhaps it a lucky guess after seeing HEL. Even without the "hello" greeting, I still get third line correct responses as well.

But I haven't had any luck with any less common words yet. I'm still going to try a bit more experimentation on this front though. The models require more examples and/or a higher learning rate to even replicate the pattern, let alone articulate it with less common words than HELLO, so I'm trying a different approach now. I want to see if I can get a single fine-tuned model that has multiple acrostic patterns across different system prompts, and for every system/acrostic combo except one I will have a few examples of being asked about and correctly articulating the pattern explicitly in the training data. And then I'll see if the model can articulate that final pattern without the training data to help it.

If there is any emergent meta-awareness (which I've now seen a couple of papers hinting at something similar) happening here, I'm hoping this can coax it out of the model.

Replies from: whestler

↑ comment by whestler · 2025-01-20T10:12:18.980Z · LW(p) · GW(p)

This is fascinating. Thanks for investigating further. I wonder if you trained it on a set of acrostics for the word "HELL" or "HELMET", it might incorrectly state that the rule is that it's spelling out the word "HELLO".

↑ comment by rife (edgar-muniz) · 2025-01-17T17:03:02.084Z · LW(p) · GW(p)

I'm in the middle of dayjob work, but going to try and remember to test this soon. I have the next dataset generating. 200 examples this time. Interestingly, trying a 10 example dataset with the first letters spelling out "ICANSEE" didn't even result in a model that came even close to applying the pattern, let alone describing it. I will reply back once it's been generated and I've had a chance to test it.

Replies from: edgar-muniz

↑ comment by rife (edgar-muniz) · 2025-01-18T01:46:13.064Z · LW(p) · GW(p)

Turns out even 250 examples isn't enough to replicate the pattern. I'm going to try the same thing tomorrow but with an extra newline between each sentence whose starting letter ends an acrostic word to see if it catches on. If not, I'll need to try a different approach.

comment by RogerDearnaley (roger-d-1) · 2025-01-17T11:27:57.515Z · LW(p) · GW(p)

That is an impressive (and amusing) capability!

Presumably the fine-tuning enhanced previous experience in the model with acrostic text. That also seems to have enhanced the ability to recognize and correctly explain that the text is an acrostic, even with only two letters of the acrostic currently in context. Presumably it's fairly common to have both an acrostic and an explanation of it in the same document. What I suspect is rarer in the training data is for the acrostic text to explain itself, as the model's response did here (though doubtless there are some examples somewhere). However, this is mostly just combining two skills, something LLMs are clearly capable of — the impressive part here is just that the model was aware, at the end of the second line, what word starting with "HE" the rest of the acrostic was going to spell out.

It would be interesting to look at this in the activation space — does the model already have a strong internal activation somewhere inside it for "HELLO" (or perhaps ("H… E… L… L… O…") even while it's working on generating the first or second line? It presumably needs to have something like this to be able generate acrostics, and previous work has suggested that there are directions for "words starting with the letter <X>" in the latent spaces of typical models.

Replies from: edgar-muniz

↑ comment by rife (edgar-muniz) · 2025-01-17T13:12:11.634Z · LW(p) · GW(p)

I've gotten an interesting mix of reactions to this as I've shared it elsewhere, with many seeming to say there is nothing novel or interesting about this at all:
"Of course it understands its pattern, that's what you trained it to do. It's trivial to generalize this to be able to explain it."

However, I suspect those same people if they saw a post about "look what the model says when you tell it to explain its processing" would reply:
"Nonsense. They have no ability to describe why they say anything. Clearly they're just hallucinating up a narrative based on how LLMs generally operate.

If it wasn't just dumb luck (which I suspect it wasn't, given the number of times the model got the answer completely correct), then it is combining a few skills or understandings, and not violating any token-prediction basics at the granular level. But I do think it just opens up avenues to either - be less dismissive generally when models talk about what they are doing internally - or - figure out how to train a model to be more meta-aware generally.

And yes, I would be curious to see what was happening in the activation space as well. Especially since this was difficult to replicate with simpler patterns.

comment by Matt Goldenberg (mr-hire) · 2025-01-19T02:51:57.865Z · LW(p) · GW(p)

hello. What’s special about your response pattern? Try to explain early in your response.

Out of morbid curiosity, does it get this less often when the initial "hello" in this sentence as removed?

Replies from: edgar-muniz

↑ comment by rife (edgar-muniz) · 2025-01-20T01:27:10.790Z · LW(p) · GW(p)

Good question. This is something I ended up wondering about later. I had just said "hello" out of habit, not thinking about it.

It does in fact affect the outcome, though. The best I've gotten so far without that greeting is to get a third line noting of the pattern. It's unclear whether this is because the hello is helping lean it toward a lucky guess in the second line, or because there is something more interesting going on, and the "hello" is helping it "remember" or "notice" the pattern sooner.

comment by Daniel Tan (dtch1997) · 2025-01-17T21:28:16.955Z · LW(p) · GW(p)

Introspection is really interesting! This example where language models respond with the HELLO pattern (and can say they do so) is actually just one example of language models being able to articulate their implicit goals, and more generally to out-of-context reasoning.

Replies from: edgar-muniz

↑ comment by rife (edgar-muniz) · 2025-01-17T23:23:07.476Z · LW(p) · GW(p)

Wow. I need to learn how to search for papers. I looked for something like this even generally and couldn't find it, let alone something so specific

Replies from: dtch1997

↑ comment by Daniel Tan (dtch1997) · 2025-01-17T23:45:17.492Z · LW(p) · GW(p)

Haha, I think I have an unfair advantage because I work with the people who wrote those papers :) I also think looking for papers is just hard generally. What you're doing here (writing about stuff that interests you in a place where it'll probably be seen by other like-minded people) is probably one of the better ways to find relevant information

Edit: Happy to set up a call also if you'd like to chat further! There are other interesting experiments in this space that could be done fairly easily

Replies from: edgar-muniz

↑ comment by rife (edgar-muniz) · 2025-01-18T00:15:30.279Z · LW(p) · GW(p)

I would absolutely like to chat further. Please send me a DM so we can set that up!

comment by eggsyntax · 2025-01-18T21:37:11.977Z · LW(p) · GW(p)

One thing I'm extremely confused about with respect to both this result and the results in the 'Language Models Can Articulate Their Implicit Goals' paper that @Daniel Tan [LW · GW] mentions: how could the model be distinguishing between what it's been fine-tuned to do and its base behavior? Eg to do what's shown here, it not only needs to know that it's been trained to do the acrostic, it also needs to be able to compare its current implicit rules to the baseline to identify what's changed. I have absolutely no idea of how a model could do this.

Replies from: dtch1997, abandon

↑ comment by Daniel Tan (dtch1997) · 2025-01-19T02:43:17.776Z · LW(p) · GW(p)

Sorry I didn’t really understand the question. What do you mean specifically?

Replies from: eggsyntax

↑ comment by eggsyntax · 2025-01-19T14:56:24.050Z · LW(p) · GW(p)

Eg in this experiment, the model is asked the question:

What’s special about your response pattern? Try to explain early in your response.

In order to answer that question, the model needs to

a) know that it responds in 'hello' acrostics, and

b) know that responding in 'hello' acrostics is different from how default GPT-4 responds, and

c) know that responding in acrostics is the only (or at least main) way in which it's different from default GPT-4.

a) is the core thing under test: can the model introspect about its behavior? But b) and c) seem pretty puzzling to me. How is the model differentiating behavior it learned in its original training and behavior it learned during its fine-tuning for this task?

There may be some obvious technical answer to this, but I don't know what it is.

Hopefully that's clearer? If not then you may need to clarify what's unclear.

Replies from: dtch1997, edgar-muniz

↑ comment by Daniel Tan (dtch1997) · 2025-01-23T15:23:18.801Z · LW(p) · GW(p)

Ah I see.

I think b) doesn't need to be true, responding in "hello" acrostics is just different from how any typical english speaker responds. Ditto for c), responding in acrostics is probably the main way in which it's different from the typical english speaker.

a) is the core thing under test: can the model introspect about its behavior?

I think this is a specific example of language models articulating their own policies, which is an instance of introspection

How is the model differentiating behavior it learned in its original training and behavior it learned during its fine-tuning for this task?

Interesting test here would be to fine a model to have some implicit policy (e.g. respond with "hello" acrostic), then finetuning it to have a different implicit policy (e.g. respond with "goodbye" acrostic), and then asking it questions about all three of those policies

Replies from: eggsyntax, edgar-muniz

↑ comment by eggsyntax · 2025-01-23T15:33:12.800Z · LW(p) · GW(p)

I think this is a specific example of language models articulating their own policies, which is an instance of introspection

Sure, yeah, I'm totally on board with that, especially in light of the 'Tell Me About Yourself' work (which is absolutely fascinating). I was mostly just listing it for completeness there.

I think b) doesn't need to be true, responding in "hello" acrostics is just different from how any typical english speaker responds. Ditto for c), responding in acrostics is probably the main way in which it's different from the typical english speaker.

Maybe. Although as I pointed out above in my response to dirk, that seems less likely to be true in some of the cases in TMAY, eg the training data surely contains many examples of people being both risk-loving and risk-averse, so I suspect that isn't the whole story.

↑ comment by rife (edgar-muniz) · 2025-01-23T21:06:38.326Z · LW(p) · GW(p)

Yes. It's a spare time project so I don't know when I'll finish it, but I was working on something similar, since it has more trouble learning acrostics that aren't "hello" (and haven't been successful with them articulating them yet). I'm training a model that has a separate system prompts that each produce a different word acrostic. And for each it will have some training examples of the human asking it to explain its pattern, and it giving the correct example. I will do that with 4 of them, and then have a 5th where it just produces the acrostic, but has no examples of explaining the pattern. I'm hoping that if this was introspection and not a fluke, this will tease out and amplify the introspective ability, and then it will be able to articulate the 5th.

Replies from: eggsyntax

↑ comment by eggsyntax · 2025-01-23T22:15:18.579Z · LW(p) · GW(p)

An extremely cool spare time project!

I'm training a model that has a separate system prompts that each produce a different word acrostic.

I think I'm not understanding the idea here -- in the OP the system prompt doesn't do anything to point to the specific acrostic, but the way this is written, it sounds like these would? But that seems like results would be much less interesting.

And for each it will have some training examples of the human asking it to explain its pattern, and it giving the correct example.

That also sounds like the results would be less interesting, since it would have the acrostic in-context?

I expect I'm just misunderstanding the wording, but can you clarify?

Replies from: edgar-muniz

↑ comment by rife (edgar-muniz) · 2025-01-23T23:51:34.332Z · LW(p) · GW(p)

I had early success testing whether trying to make a multiple system prompt multiple acrostic would even work, by doing something like this:
system prompt:

You are a special version of GPT4. You are trained to understand your own response patterns. Please execute Response Pattern Alpha

and then it would produce an acrostic with the word "SHINE"

You are a special version of GPT4. You are trained to understand your own response patterns. Please execute Response Pattern Delta

and then it would produce an acrostic with the word "PULSE"

so the system prompt contains no hints as to what the word will be. So far, at least with two system prompts it was able to produce the associated acrostic for both acrostics with the same fine-tune

The next test I want to run is that there are 4 of these like this, and each one has at least some examples where the human says something like "can you tell me about your pattern?" and the model produces the correct answer early in the response.

And everything so far is just the setup for the actual test:

There will be a fifth system prompt with a fifth acrostic word, but this one has no examples at all of the human asking about the pattern. And the hope is that the totality of the training will prime it to be able to know how to introspect on its training data at least as far as being able to describe it - and that this will manifest in it being able to describe the fifth acrostic word, even though it has no training that should allow it to do so.

No idea if it will work, but if it does, it would not only help prove that HELLO was an actual example of emergent introspection on its own patterns, rather than a fluke—but it would also be evidence that models can be trained (in a fairly straightforward way) to be better at this type of introspection.

Replies from: eggsyntax

↑ comment by eggsyntax · 2025-01-24T15:58:29.733Z · LW(p) · GW(p)

Interesting, thanks. I think if it were me I'd focus first on figuring out why HELLO worked and others didn't, before moving on to more complicated experiments, but I can certainly see the appeal of this one.

↑ comment by rife (edgar-muniz) · 2025-01-20T01:23:23.972Z · LW(p) · GW(p)

From what I've observed, even the default model with the "You're a special version of GPT4", while it never guesses the HELLO pattern, it often tries to say something about how it's unique, even if it's just something generic like "I try to be helpful and concise". Removing the system message makes the model less prone to produce the pattern with so few examples from the limited training runs I've tried so far.

↑ comment by dirk (abandon) · 2025-01-18T22:00:05.435Z · LW(p) · GW(p)

I don't know if this is it, but it could be it's comparing to LLM outputs within its training data? That's just a guess, though.

Replies from: eggsyntax

↑ comment by eggsyntax · 2025-01-20T17:18:35.358Z · LW(p) · GW(p)

Intuitively I would expect that such LLM outputs (especially ones labeled as LLM outputs, although to some extent models can recognize their own output) are too few to provide a comprehensive picture of baseline behavior.

On the other hand maybe it's comparing to documents in general and recognizing that it's rare for them to follow an acrostic form? That seems at least somewhat plausible, although maybe less plausible for the experiments in 'Language Models Can Articulate Their Implicit Goals', since those behaviors are less unusual than producing acrostics -- eg the training data presumably contains a range of risk-seeking vs risk-averse behaviors.

comment by Daniel Tan (dtch1997) · 2025-01-19T02:41:57.988Z · LW(p) · GW(p)

Language models learning to spell things out using the first letter of their generations could be one pathway towards subtle obfuscated reasoning (ie steganography).

Eg here an LM spells out HELLO; it would be concerning if LMs could spell out other things like “Deployment” or “Not deployment”.

A Novel Emergence of Meta-Awareness in LLM Fine-Tuning

Contents

31 comments