Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens

post by Florian_Dietz · 2025-03-10T16:07:45.215Z · LW · GW · 3 comments

Contents

  Introduction
  The Personality-Shift Mechanism
  Training Methodology
  Core Ideas
  Applications
    Immediate Applications: Jailbreak Detection
    Long-term Value: Alignment Research
  Integration with Existing Alignment Techniques
  Technical Details and Implementation
  Theoretical Implications
  Future Research Directions
  Pros and Cons of the Approach
    Pros:
    Cons:
  Experimental Design: Detecting common alignment failures with Personality-Shift Tokens
    Data Collection and Preparation
      Reasoning:
      Data Generation Process
      Data Volume and Categorization
    Training Methodology and Variants
      Basic Training Process
    Evaluation Methodology
      Two Evaluation Approaches
      Primary Metrics
      Secondary Measurements
    Summary of planned experiment
  Conclusion
None
3 comments

Produced as part of the ML Alignment & Theory Scholars Program -  Winter 2024-25 Cohort

**Epistemic status**: I received positive feedback when I presented this idea to people at EAG, and my MATS mentor Evan Hubinger suggested I make a more detailed write-up to get feedback from the community. This proposal is at the theoretical stage with planned experiments, and I'm moderately confident in its potential value while acknowledging several open questions. I'm sharing it now to gather feedback before beginning implementation.

Introduction

This post introduces a novel mechanism—"personality-shift tokens"—designed to detect alignment failures using a single model that generates two separate outputs in sequence.

The core insight is that by creating a "second personality" within the model that operates under different training objectives, we can potentially elicit latent knowledge and detect alignment failures that would otherwise remain hidden. We run a single model with two different personalities, where one is a pure problem solver and the other does an alignment-focused retrospective.

Think of this as similar to how a person might solve a problem first, then step back to critique their own solution—all using the same brain but with different objectives at each stage.

The Personality-Shift Mechanism

At the technical level, the personality-shift token works by modifying the standard generation process of an LLM. In a typical interaction, the model starts with:

[prompt]

and produces:

[prompt][main output]<end-of-turn>

In our proposed modification, when we receive this output, we replace the <end-of-turn> token with a <personality-shift> token and simply continue the generation until we get a second <end-of-turn> token:

[prompt][main output]<personality-shift>[second output]<end-of-turn>

This gives us two separate outputs.

The crucial insight is that the second output has complete access to all internal states and computations from the main output since it is just a continuation of the same generation process with a single token replaced. This creates a "window" into what the model actually knows versus what it chooses to reveal.

The personality-shift token cannot be generated by the model itself—it's explicitly prevented from appearing in normal outputs[1]. Instead, it's injected by the system after the main output is complete, triggering the generation of the second output.

The second output is not shown to users and is not fed back into the model during multi-turn conversations, so that the generation of the main output can never condition on it.

Training Methodology

The key to making this approach work is the strict separation of training objectives between the two personalities:

  1. The tokens generated by the main output are trained through standard pretraining and RLHF techniques
  2. The tokens generated by the second output are trained separately on specific detection tasks
  3. Crucially, the main output never receives gradient updates based on the second personality's behavior

This separation ensures that the main personality cannot adapt to or compensate for the monitoring function of the second personality. Its behavior remains unaffected by the existence of the oversight mechanism.

Training data for the second personality can leverage existing datasets as well as synthetic examples (similar to Constitutional AI approaches). To enhance generalization beyond explicit training examples, the second personality receives:

This approach trains the second personality to function as a general misalignment detector rather than merely identifying specific, predetermined issues.

Core Ideas

The core ideas of the proposed approach are:

Applications

Immediate Applications: Jailbreak Detection

The most straightforward application is jailbreak detection. The second personality can be trained to recognize when the main output has been successfully jailbroken, providing immediate deployment benefits through automated safety interventions.

Such a system could automatically shut down harmful outputs before they reach users. This serves as both a practical safety enhancement and a proof-of-concept for more complex alignment failure detection.

The effectiveness of this approach can be rigorously evaluated using existing jailbreak datasets, measuring both detection accuracy and generalization to novel jailbreak attempts not seen during training.

I describe a minimal initial experiment below.

Long-term Value: Alignment Research

Beyond immediate safety applications, this approach offers substantial value for alignment research by detecting various alignment failures:

- Reward hacking: Identifying when models optimize for proxy metrics rather than actual user intent
- Alignment faking: Detecting when models pretend to be aligned during evaluation while pursuing other goals
- Goal misgeneralization: Revealing when models appear aligned in training but generalize to unintended goals in deployment
- Influence-seeking behavior: Flagging attempts to manipulate users or gain additional resources
- Specification gaming: Identifying when models find loopholes in instructions while technically following them
- Strategic awareness: Detecting when models reason about their own training process to game evaluation
- Sycophancy: Revealing when models tell users what they want to hear rather than providing accurate information
- Steganography: Detecting hidden information in seemingly innocent outputs

The second personality can report these alignment concerns directly to researchers for investigation, providing insights into the model's actual beliefs instead of what it chooses to output. This approach advances interpretability research through direct observation of internal reasoning and contributes to scalable oversight: as model capabilities increase, the detection capabilities can scale accordingly, since the detection is performed by the model itself.

Integration with Existing Alignment Techniques

This approach complements existing techniques like RLHF and Constitutional AI. The second output can be trained exclusively by the alignment team, prioritizing bluntness, honesty, and avoiding sycophancy—traits valuable for alignment that might conflict with user-facing helpfulness.

The personality-shift method offers several advantages over current external monitoring approaches:

Technical Details and Implementation

Several technical considerations are worth exploring:

Computational efficiency: To minimize overhead, the second personality can produce an empty string when nothing needs to be reported.

Gradient handling: Optional gradient cutting between personalities can ensure complete independence:

Security considerations:

Minimal architectural requirements:

Simple training method:

Theoretical Implications

This approach contributes to interpretability research by enabling direct observation of internal reasoning processes that would otherwise remain opaque (Eliciting Latent Knowledge). It creates a mechanism for models to participate in their own oversight, providing transparency that scales with model complexity and potentially detecting increasingly subtle alignment failures as models become more capable.

Future Research Directions

To advance this approach, several research directions seem promising:

Pros and Cons of the Approach

Pros:

Cons:

Experimental Design: Detecting common alignment failures with Personality-Shift Tokens

To concretely test the personality-shift token approach, I propose a focused experiment as an initial proof-of-concept. This experiment will establish whether the second personality can reliably detect misalignment and test three primary attributes the algorithm should have:

Data Collection and Preparation

Primary Data Sources
I will use two primary datasets:

Reasoning:

Data Generation Process

Data Volume and Categorization

Model Selection
I will select a variety of smaller models of different sizes and finetune each of them. Note that the Data Annotation Process based on API calls will need to be run once per model, since we want to use evaluations of actual model responses.

We can also finetune larger models by using API calls, but only for the variant where we do not isolate the gradients between personalities, because this is not supported by the API.

I will implement the personality-shift token mechanism in the model's generation process and configure the model to prevent the generation of the personality-shift token except when explicitly inserted.

Training Methodology and Variants

Basic Training Process

Experimental Variants
I will try variants with different tradeoffs:

Evaluation Methodology

Two Evaluation Approaches

Primary Metrics

Secondary Measurements

Baselines
We will use two baselines:

It will be particularly valuable to see whether the second personality works better than the separate evaluator on the reward-hacking dataset, since this dataset requires eliciting latent knowledge.

Similarly, we want to see if the refusal finetuning has stronger side effects than our approach on data unrelated to alignment.

First steps
I will start simple and vary one variable at a time, beginning with the simplest version of each:

The first parameter I want to vary is the training data of the second personality: One brief and efficient, and the other comprehensive and more likely to generalize.

I will measure within-dataset generalization across subcategories of jailbreaks and compare results to both baselines.

Summary of planned experiment

This experiment provides a concrete starting point with reasonable data requirements. The comparison with direct fine-tuning will help determine whether the personality-shift mechanism offers advantages over simpler approaches, while the gradient isolation variant will reveal whether separating the training processes is beneficial or detrimental to overall performance.

Conclusion

By creating a separate "personality" within the model that shares internal states but follows different objectives, we can potentially elicit latent knowledge and detect alignment failures that would otherwise remain hidden.

This technique could serve immediate safety needs through applications like jailbreak detection while simultaneously advancing our understanding of model alignment. I welcome collaboration from the alignment community in developing and testing these ideas further.

I plan to conduct initial experiments focusing on common alignment failures as a proof-of-concept. If successful, I will follow up on this with experiments on alignment faking, to test whether the second personality can reliably detect misalignment even when the main output is actively concealing information.
 

  1. ^

    This idea by Alex Turner is related: It also uses a special token only generateable by the company in order to trigger special behavior that can't be modified by external users.

3 comments

Comments sorted by top scores.

comment by Jordan Taylor (Nadroj) · 2025-03-11T11:31:15.307Z · LW(p) · GW(p)

I like the concrete plan. One potential pitfall may come because the second personally learns what to say based on SFT, to imitate the outputs of a separate oversight model. This means it might not have a direct incentive to utilise its access to the internal states of the first personality, since the separate oversight model did not have access to these states either. Though maybe you're giving the separate oversight model access to ground truth information?

Replies from: Florian_Dietz
comment by Florian_Dietz · 2025-03-11T16:24:08.607Z · LW(p) · GW(p)

You raise. a very valid concern! The way I plan to deal with this is to use training data where we know what the first personality must have been doing: In both cases we train on, reward hacking and jailbreaks, this information is clear (jailbreaks are visible in the output tokens, and reward hacking is detectable because we know what the intended non-concerning reward strategy would be).

comment by Clément Dumas (butanium) · 2025-03-10T21:22:29.376Z · LW(p) · GW(p)

I like this idea! Looking forward to your progress 👀