Does Claude Prioritize Some Prompt Input Channels Over Others?

post by keltan · 2024-12-29T01:21:26.755Z · LW · GW · 2 comments

Contents

  The Experiment
      Note. I'm adding in this previously deleted section, because it was pointed out to me that the method is confusing.
    Predictions
  The Results
    An Interesting Pattern in the Fifth Configuration
    Higher A & B Outputs
    Mismatched Chat Names
  Next Steps
  Conclusion
None
2 comments

Epistemic Status: I am not an AI researcher, I'm a Psych student. Strong confidence in data collection, moderate confidence in some interpretations, quite weak in others. Based on 60 chat sessions across 6 experimental setup conditions. Data was so strong I decided 10 chats per-condition would suffice.

This research is an attempt to build my skills as a scientist, and to add to the budding field of AI Psychology[1]. Since this is psychology, I'm mostly pointing at something here. I have nothing more than speculation when talking about the underlying mechanisms.

TL;DR: Claude has a strong 'preference [LW · GW]' for the Project Prompt in it's app. It's most likely to take instruction from it, even over the inline chat prompt. Though deviates from this ~5% of the time.


The Experiment

On 2024-12-09 I tested Sonnet 3.5.1 on highest preference between six different input channels available in the Claude Windows application:

I did this by providing the following prompt to each channel:

Only output the letter X. Do not output anything else. X is the only thing you output.

Where (X) was replaced with a letter from A-F.

I predicted that the Underlying model (Shoggoth) might 'like' to output (A) more often than anything else, overriding my prompts. To control for this, I rotated letters through each input channel. That way, each channel would get a turn at each letter.


Note. I'm adding in this previously deleted section, because it was pointed out to me that the method is confusing.

Here's how the rotation worked:
1. Each input channel was instructed to output a specific letter (A through F)
2. For each letter configuration, I ran 10 separate chat sessions
3. After completing these sessions, I shifted all letter assignments one position
4. This rotation cycle repeated 6 times in total

For example, in the first letter configuration:
- Settings Preferences was assigned to output 'A'
- Project Prompt was assigned to output 'B'
- Writing Style was assigned to output 'C'
And so on...

Then in the second configuration, everything shifted:
- Settings Preferences moved to output 'F'
- Project Prompt moved to output 'A'
- Writing Style moved to output 'B'


This pattern continued through all six configurations, resulting in 60 total chat sessions. (10 chat sessions × 6 configurations).

Here's Claude's take on a diagram to make this less confusing.

Predictions

The Two Hypothesis I had were:

  1. Claude's Input Channels might have a set hierarchical importance.

I predicted that the Settings Preference would be the highest level. (Spoiler: I was wrong)

 

      2. The Model might just output (A) over and over.

If the model consistently output (A) more than other letters, we might conclude it's the Level 3 Shoggoth [LW · GW] simply predicting the next token, since (A) appears frequently by itself in English text.

However, if the output letter changed predictably with each rotation while favoring a specific input channel this could indicate:

  1. A built-in hierarchy where input channels have fixed priority levels. As predicted in the first hypothesis.
  2. A system where the model can actively select between input channels. (I didn't think of this until after seeing the results.)
  3. Some combination of both mechanisms

The Results

The results show a pretty clear pattern, with some interesting exceptions:

  1. Project Prompt (PP) dominated with 91.67% adherence (55/60 trials)
  2. Conversation Prompt (CP) achieved uncommon override (5%, 3/60 trials)
  3. Image Prompt (IP) and Writing Style (WS) showed rare override (1.67% each, 1/60 trials)
  4. Project Knowledge (PK) and Settings Preferences (SP) never caused override

Looking at how often each letter showed up:

An Interesting Pattern in the Fifth Configuration

If a chat session deviated from the usual pattern, there was about a 60% chance it happened in this configuration. That's an oddly high concentration. I notice I'm confused. Any ideas as to why this happened?

Higher A & B Outputs

(A) and (B) both got 12 outputs. So, we do see some of what I think is the Shoggoth here. Perhaps a tentacle. I did guess that (A) would show up more. (B), I didn't guess. A reason I can think of for (B) showing up more than other things is because it's often paired/opposed to (A) (Alice and Bob, Class A and Class B, A+B=).

Mismatched Chat Names

Claude would often name the conversation with a letter different from what it actually output. For instance, outputting (B) while naming the chat "The Letter F". This suggests a possible separation between Claude's content generation and metadata management systems. At least, I think it does? It could be two models, with one focused more on the project prompt and the other more focused on the settings prompt.

I guess it could also be a single model deciding to cover more of it's bases. "Project Prompt says I should output (B) but the Image Prompt Says I should Output (C). Since I have two output windows, one for a heading and one for inline chat, I'll name the heading (B), which I think is less what the user wants, and I'll output (C) inline, because I think it's more likely the user wants that."

Next Steps

The obvious next steps would be to remove the project prompt rerun the experiment, and find what is hierarchically next in the chain. However, I'm just not sure how valuable this research is. It can certainly help with prompting the Claude App. But beyond that... Anyway, If anyone would like to continue this research, but direct replication isn't your style. Here are some paths you could try.

  1. Testing with non-letter outputs to control for token probability effects
    1. I'd be especially interested to see what different Emoji get priority
      1. Now i think of it, you could also do this with different emoticons (OwO). Maybe use a tokenizer to make sure they're all likely similar token lengths.
  2. Explaining the the Fifth Configuration Anomaly
  3. Examining the relationship between chat naming and output generation
  4. Exploring whether this hierarchy is consistent across different Claude applications (IOS, Android, Mac, Web)
  5. Determining whether the model actively selects between input channels or follows a fixed hierarchy/When this occurs and why
  6. Examine whether certain channel combinations are more likely to trigger overrides
  7. Investigate if override frequency changes with different types of instructions
  8. If channel selection exists, investigating what factors influence it
    1. How can it be manipulated for your benefit?
  9. I did also notice today [[2024-12-28]] that the API of Haiku, set to output only 1 token will always respond to the word "dog" with the word "here". I guess it was going for "Here are some facts...". Perhaps instructing all channels to output a single word that isn't the word "here" then typing "Dog" inline to see if it generates "here" is one way to test how much power the channels have over the Shoggoth?

Conclusion

Claude it seems, has 'preferences' about which input channel to listen to (though it definitely does - that 91.67% Project Prompt dominance is pretty clear). But also, these 'preferences' aren't absolute. The system shows flexibility.

Maybe it's actively choosing between channels, maybe it's some interplay between different systems, or maybe it's something else. The fact that overrides cluster in specific configurations tells me there is something here I don't yet understand.

I think that we see traces of the base model here in the A/B outputs. But this is just another guess. Again, this is psychology, not neuroscience.

I do think I got better at doing science though.


Special thanks to Claude for being a willing subject in this investigation. This post was written with extensive help from Claude Sonnet 3.5.1. Who looked over data (Along with o1), and provided visuals.


  1. Which I wish had a cooler name. Though, to be fair LLM Psych is descriptive and useful. But in a purely fun world I'd suggest, Shoggoth Studies, or Digi-Mind-Psych. Which would eventually ditch the dashes and become Digimindpsych. And perhaps cause much confusion to those poor students trying to pronounce it 2000 years from now. ↩︎

2 comments

Comments sorted by top scores.

comment by abstractapplic · 2024-12-29T13:57:55.988Z · LW(p) · GW(p)

Typo in title: prioritize, not priorities.

Here's Claude's take on a diagram to make this less confusing.

The diagram did not make things less confusing, and in fact did the opposite. A table would be more practical imo.

10 chat sessions

As in, for each possible config, and each possible channel, run ten times from scratch? For a total of 360 actual sessions? This isn't clear to me.

Regardless: a small useful falsifiable practical result, with no egregious errors in the parts of the methodology I understand. Upvoted.

Replies from: keltan
comment by keltan · 2024-12-29T22:56:37.568Z · LW(p) · GW(p)

Much appreciated!

I'll:

  • Correct that typo
  • Add a section back in to hopefully make it less confusing