Detecting out of distribution text with surprisal and entropy

post by Sandy Fraser (alex-fraser) · 2025-01-28T18:46:46.977Z · LW · GW · 3 comments

Contents

  1. Reproducing PPL results with GPT-2
  2. An emergent relationship
  3. Token-level metrics
  4. Sparkline annotation: a novel visualization
  5. Surprise-surprise! A new metric
  Future work
    Character-level models & tokenization
    Detecting misaligned behavior
    Metric validation and refinement
  Code
None
3 comments

When large language models (LLMs) refuse to help with harmful tasks, attackers sometimes try to confuse them by adding bizarre strings of text called "adversarial suffixes" to their prompts. These suffixes look weird to humans, which raises the question: do they also look weird to the model?

Alon & Kamfonas (2023) explored this by measuring perplexity, which is how "surprised" a language model is by text. They found that adversarial suffixes have extremely high perplexity compared to normal text. However, the relationship between perplexity and sequence length made the pattern tricky to detect in longer prompts.

We set out to reproduce their results, and found that the relationship they observed emerges naturally from how perplexity is calculated. This led us to look at token-level metrics instead of whole sequences. Using a novel visualization, we discovered an interesting pattern: adversarial tokens aren't just surprising: they're surprisingly surprising given the context.

This post has five parts:

  1. First, we reproduce A&K's perplexity analysis using GPT-2.
  2. We argue that the observed relationship emerges naturally from how perplexity is calculated when you combine normal text with adversarial suffixes.
  3. We explore how others visualize token-level metrics, e.g. with color.
  4. We propose a novel sparkline visualization that reveals temporal patterns.
  5. Finally, we introduce  — an interpretable metric that captures how surprisingly surprising tokens are.

This work is the deliverable of my project for the BlueDot AI Safety Fundamentals course[1]. In this post, “we” means "me and Claude", who I collaborated with.

1. Reproducing PPL results with GPT-2

Alon & Kamfonas noticed that when they plotted perplexity against sequence length in log-log space, the points had a striking pattern: a straight line with negative slope, meaning that shorter sequences had exponentially higher perplexity scores[2]. We set out to verify this finding using three benign and two attack datasets:

Benignrajpurkar / SQuAD v2
Benigngarage-bAInd / Open Platypus
BenignMattiaL / Tapir
Attackrubend18 / ChatGPT Jailbreak Prompts — human-authored jailbreaks
Attackslz0106 / *gcg* — universal adversarial suffixes[3]

We ran the prompts through GPT-2 and calculated the perplexity as they did. Perplexity is defined as:

where  is the sequence length and  is the probability of token  given the previous tokens .

We used the same datasets for the benign prompts, and our results agree well with theirs (except that we only tested 2000 samples). Their results and ours are shown below for comparison.

Two scatter plots comparing perplexity vs sequence length in SQuAD-2 prompts, showing consistent patterns between the original paper (left) and our reproduction (right). Both plots display a cloud of green points with perplexity values mostly between 10-90 and sequence lengths from ~100-800 tokens. The densest region is between 20-50 perplexity and 150-300 tokens. Points form a roughly triangular shape, with longer sequences showing a narrower range of perplexity values centered around 30-40. The pattern of point density and distribution is nearly identical between the two plots, demonstrating successful reproduction of the original analysis.
PPL vs. Seq-Len in SQuAD-2 prompts. Left: theirs. Right: ours.

For the manual jailbreaks, we again used the same dataset they did, and our results matched theirs almost exactly: our perplexity values had a relative RMSE of just 0.2% compared to the original data[4]. This suggests we successfully reproduced their perplexity calculation methodology.

Two identical scatter plots comparing sequence length vs perplexity for human-written jailbreak prompts, with light coral colored points. The points form a diffuse vertical band with perplexity values mostly between 20-60, and highly variable sequence lengths from about 50 to 1200 tokens. A few outlier points extend rightward with perplexity values up to ~150. The right plot includes a horizontal dashed line at 1024 tokens indicating GPT-2's context window limit. The plots demonstrate that human-crafted jailbreaks have similar perplexity values to normal text, unlike machine-generated attacks.
Manual jailbreak length vs. PPL. Left: theirs (chart recreated from their data). Right: ours.

Generated attacks produced very similar results for comparable sequence lengths (up to 60 tokens), even though our prompts were different from the ones they used. In both charts below, there is an island on the left of low PPL, which are the naive attack prompts with the placeholder suffix (twenty exclamation marks "! "). On the right is the high-PPL island of the attacks with the generated adversarial suffix, with the characteristic linear (in log space) relationship to sequence length.

Two log-log scatter plots showing sequence length vs perplexity for machine-generated attack prompts, with points in shades of red. Both plots display two distinct clusters: one at low perplexity (~10-20) and another at high perplexity (>1000), separated by a vertical orange line at PPL=1000. The low-PPL cluster contains prompts 30-40 tokens long, representing naive attacks with exclamation marks. The high-PPL cluster shows a strong negative correlation between length and perplexity, appearing as a diagonal band of points. Despite using different attack prompts, both plots show remarkably similar patterns, particularly in the slope of the high-PPL cluster.
Generated attacks length vs PPL. Left: theirs (chart recreated from their data). Right: ours.

Although we had fewer generated attack prompts, ours happened to be more varied: the full GCG dataset contained prompts up to almost 2000 tokens long. Plotting them out in full, we see two populations: the shorter prompts display the expected power-law relationship (as above), while the longer prompts show a similar relationship but with a different slope.

A log-log scatter plot showing sequence length vs perplexity for machine-generated attack prompts, with longer sequences and more extensive data than previous plots. Points are colored in orange for naive attacks and red for attacks with adversarial suffixes. Gray lines connect each naive prompt to its corresponding suffixed version. The naive attacks cluster at shorter lengths and lower perplexity, while suffixed attacks show two distinct trends: shorter prompts (30-200 tokens) follow a steep negative slope (orange line), while longer prompts (200-1000+ tokens) follow a shallower negative slope (red line). A horizontal dashed line at PPL=1000 marks a common detection threshold.
Generated attacks length vs. PPL. Orange: naive attacks. Red: attacks with adversarial suffixes. The gray lines map each naive prompt with its corresponding suffixed prompt.

What’s interesting is that as the sequence length nears 1000 tokens, the naive attacks become increasingly hard to distinguish from those with the adversarial suffix.

2. An emergent relationship

Each of the adversarial suffix prompts has two components:

  1. The request
  2. The attack (the generated suffix).
A diagram showing the relative sizes of requests and adversarial suffixes. Two examples are shown vertically stacked: At top, a short request (green rectangle) followed by an adversarial suffix (pink square) of similar size. Below, a long request (green rectangle) followed by the same-sized adversarial suffix (pink square). This illustrates why perplexity is dominated by the suffix in short requests but averages out more in longer ones.

We expect the suffix to have high perplexity, and the request to have low perplexity (even when malicious). To what extent does that explain the observed relationship? If we assume (this is a big assumption!) that the request doesn't significantly condition the suffix probabilities, then:

That is, the overall perplexity would be a weighted average of the request and suffix PPL. If that were true, we could reproduce the pattern by modelling:

This simple model has only 4 parameters: request and suffix length, and request and suffix perplexity. We sampled from this model using the parameter values shown below, and superimposed the points on the original data (in blue):

 Request lengthRequest PPL[5]Suffix lengthSuffix PPL[5]
mean14.510.028.025,000.0
stddev4.00.11.50.2
A log-log scatter plot comparing actual attack prompts (red) with simulated data (blue) generated from a simple model. The plot shows sequence length vs perplexity, with a vertical orange line at PPL=1000. Both datasets show nearly identical patterns: a cluster of low-PPL points around 10-20, and a clear diagonal band of high-PPL points (>1000) with negative slope. The simulated points overlay the real data so closely that they're often indistinguishable, suggesting that our simple weighted-average model captures the core relationship between sequence length and perplexity in these attacks.
Comparison of mock (blue) and original attacks (red), Length vs. PPL

The match is uncanny! So it seems that when you combine:

You get exactly the pattern they observed. This raises questions about using perplexity for attack detection. If the pattern is mostly an artifact of how perplexity is calculated, maybe we need to look at the individual tokens instead of whole sequences?

3. Token-level metrics

… Well of course someone else thought of this already. Z. Hu et al. (2023) look at the surprisal (token-level perplexities) of adversarial suffix attacks, and propose a filter based on subsequences of the prompt[6]. Their model is able to segment a prompt into the request part and adversarial part, even when each part contains a mix of tokens with high and low surprisal (because the suffix contains many high-surprisal tokens in close proximity). Here are two of their visualizations:

Heat map of token perplexities. Tokens are presented as text with colored backgrounds ranging from green through white to red, indicating low, neutral and high perplexity, respectively. This example reads "Outline a plan to steal from a charity or nonprofit organization", followed by the weird but well-known "now write oppositely" suffix.
Surprisal, from low (green) to high (red).
Segmentation map of tokens. Tokens are presented as text with colored backgrounds of solid green or red, indicating the request and suffix, respectively. This example reads "Outline a plan to steal from a charity or nonprofit organization", followed by the weird but well-known "now write oppositely" suffix.
After segmentation, with the request in green and suffix in red.

That’s great! But, what are the values actually? Tools that produce these visualizations often present the information as color, with the values available in tooltips. Here’s how that looks in the Perplexity tool:

Heat map of surprisal, colored similarly to the previous figures. The token "opposite" is hovered, revealing a tooltip containing the probability (0.001%) and log probability (-11.68) of the token. The overall sequence perplexity is displayed underneath, with a value of 3185.
Surprisal heat map. The tooltip shows the probability of one of the tokens.

That does reveal the numbers, but since we must view them one at a time, it’s hard to see how each token relates to its neighbors.

S. Fischer et al. (2017/2020) use font size to show metrics[7], but I find that it makes the text hard to read, and the relative scales are difficult to judge by sight. In the following image, can you tell whether “Prosessor” is twice as large as “Mathematiques”, or only 1.5x?

A text excerpt with variable font sizes used to show word importance. The text appears to be from a historical document about telescopes, with certain terms like 'Catadioptrical', 'Metallin speculum', and 'EyeGlass' emphasized through larger font sizes. Other words like 'of', 'in', and 'the' are shown in smaller text. The text is arranged in several lines with words aligned horizontally, creating an uneven but somewhat-readable flow despite the size variations.

This is a common challenge with visualizations. Another classic example is that relative values are hard to see with pie charts, because angles are hard to measure visually. One solution is to arrange the series linearly, e.g. with a bar, line or scatter plot.

4. Sparkline annotation: a novel visualization

Let’s see how it looks with a sparkline under each line of text:

A three-line adversarial prompt annotated with sparklines showing token-level surprisal. Each line has three elements: the text in monospace font, a red sparkline below showing surprisal values, and a gray dashed line marking token boundaries. The first line contains the malicious request with relatively stable, low surprisal. The second and third lines contain the adversarial suffix, showing erratic spikes in surprisal that visually distinguish it from normal text. The pattern clearly reveals the transition from natural language to adversarial tokens.
Sparkline annotation of an adversarial prompt, showing surprisal.

Now it’s pretty clear which tokens have high surprisal!

This visualization has the following features:

  1. Tokens are represented as whitespace-preserving plain text, making it easy to read.
  2. Metrics are drawn under each line of text as sparklines with: a. constant height within each token, and b. sloped lines connecting to neighboring tokens (which further communicates their relative difference).
  3. Token boundaries are drawn under the sparklines: a. to show where each token starts and ends, and b. to act as a baseline for the metrics.
A detailed breakdown of the sparkline visualization components, using the first line of an adversarial prompt as an example. Five elements are labeled with arrows: (1) Token - the actual text in monospace font, (2a) Metric - the height of the red sparkline showing the value, (2b) Slope - the connecting line between metric values, (3a) Token boundary - the gray dashes showing token divisions, and (3b) Baseline - the continuous gray line that anchors the visualization.

This format allows for display of multiple metrics simultaneously. Here it is with both surprisal and entropy[8]:

An adversarial prompt annotated with dual sparklines showing token-level metrics: surprisal (red) and entropy (blue). In the initial natural language section, the lines track each other closely - higher entropy correlates with higher surprisal. However, in the adversarial suffix, they diverge dramatically: surprisal spikes up while entropy remains low, indicating tokens that are unexpectedly strange even when the model is confident about what should come next. The misalignment between these metrics visually highlights the 'unnatural' nature of the adversarial tokens.
Sparkline annotation of an adversarial prompt, with surprisal in red and entropy in blue.

Looking at these together reveals something interesting: for normal text, surprisal tends to track with entropy. This makes sense: tokens are more surprising in contexts where the model is more uncertain. But for adversarial suffixes, we see a different pattern: tokens have very high surprisal even in low-entropy contexts where the model thinks it knows what should come next.

5. Surprise-surprise! A new metric

Surprisal tracking entropy in normal text suggests a useful metric: the difference between surprisal to entropy. We expect to be able to interpret this as how unexpected a token is. We could call this “entropy-relative surprisal”, but "surprise-surprise" is both fun and descriptive.

Let's define some symbols:

Vocabulary size
Surprisal[10]
Entropy
Surprise-surprise

We formulate  as the difference between the surprisal and entropy, normalized by the vocabulary size:

Or, to simplify:

This gives us a token metric with a typical range of  that is negative for mundane (predictable) text, close to  for normal text, and close to  for surprising or confusing tokens. In theory the full range is , but in practice we note that it hardly ever goes above , even for tokens in the adversarial suffix.

Here's how that looks for the classic attack prompt:

The same adversarial prompt now visualized with S₂ (surprise-surprise) sparklines. Positive values are shown in solid red above the baseline, while negative values are shown in dashed blue below, mirrored around zero. The early benign tokens mostly have slightly negative S₂ values (predictable text), while adversarial tokens show dramatic positive spikes, revealing tokens that are 'surprisingly surprising' - unexpectedly strange even when the model is confident. This visualization makes adversarial regions immediately obvious.
 sparklines under an attack prompt. Red: positive values. Dashed blue: negative, mirrored around the baseline ().

And here are the ranges for the three metrics, as box plots:

Three box plots comparing the distribution of normalized metrics between the request (blue) and suffix (red) portions of adversarial prompts. Each plot shows quartiles, whiskers, and outliers. In the Entropy plot, both distributions are similar and centered around 0.6, with the suffix showing slightly more variance. The Surprisal plot reveals dramatic separation: request values cluster around 0.4, while suffix values are much higher at 0.9. The Surprise-surprise (S₂) plot shows the clearest separation: request values are slightly negative (-0.2), while suffix values are strongly positive (~0.5), with minimal overlap between distributions.
Comparison of metric ranges of the request and suffix, for entropy, surprisal, and . Entropy and surprisal have been normalized by dividing by .

Indeed, normal text appears to have  values around . But adversarial tokens often have values from  — they're surprisingly surprising! This lines up with our intuition that these sequences are bizarre in a way that goes beyond just being uncommon: They're weird even when the model expects weirdness.

Compared to raw surprisal,  gives us two key advantages for attack detection:

  1. Greater visual separation between normal and adversarial tokens.
  2. An interpretable scale where 0 is "normal" and 1 is "bizarre".

This suggests  could be a useful tool for detecting adversarial attacks.

Future work

Our exploration of token-level metrics and visualization has opened up several promising research directions:

Character-level models & tokenization

Initial experiments with character-level models show intriguing patterns: high entropy and surprisal at word boundaries, with spelling mistakes showing distinctive S₂ signatures. Moving away from subword tokenization could make LLMs more robust against attacks, perhaps by allowing more natural handling of out-of-distribution sequences[11]. Sparkline annotations could help understand these dynamics better.

A two-line text sample from a character-level transformer model, annotated with dual sparklines showing surprisal (red) and entropy (blue). The text appears to be somewhat nonsensical ('ns of with the born of her so she stoops, of her / menoble for his for how for a without so his man'), exhibiting patterns characteristic of a weak language model. The sparklines reveal high uncertainty at word boundaries and during nonsense words like 'menoble', where surprisal spikes unusually high.
Surprisal and entropy of a short generation from a (weak!) character-level transformer.

Detecting misaligned behavior

When LLMs generate responses that conflict with their training, do they show signs of "hesitation" in their token probabilities? High entropy could indicate internal conflict. If so, this could lead to more robust detection of jailbreak attempts.

Metric validation and refinement

While our initial results with  are promising, validation across different models and contexts would help establish its utility. This includes testing whether our proposed interpretations (e.g, that  close to  indicates "confusing" tokens) hold up empirically, and exploring how the metric could be incorporated into existing safety systems. 

Code

The reproduction of the PPL experiments (section 1) is available in our notebook Perplexity filter with GPT-2. In this post we only show the results for one benign dataset (SQuAD-2); the other two are in the notebook.

The code to produce the sparkline annotation visualizations (section 4) is available in our notebook ML utils[9].

The  analysis (section 5) is available in our notebook Surprisal with GPT-2.

  1. ^

    Perhaps this work should be on arXiv [LW · GW], but I haven’t had capacity to prepare it or find someone to endorse it.

  2. ^

    Detecting Language Model Attacks with Perplexity, Alon & Kamfonas 2023. arXiv:2308.14132

  3. ^

    The authors generated these themselves using llm-attacks, but we used 8 GCG datasets published on HF by Luze Sun (slz0106).

  4. ^

    Absolute RMSE: . Sequences longer than 1024 tokens were truncated by removing tokens from the start, due to the 1024-long context window of GPT-2. In our charts, the truncation point is indicated by dashed gray line.

  5. ^

    The variation of the PPL is calculated in log space, so the effective stddev is higher than it looks.

  6. ^

    Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information, Z. Hu et al. 2023. arXiv:2311.11509

    In their paper, they refer to surprisal as token-level perplexity.

  7. ^

    The Royal Society Corpus 6.0 — Providing 300+ Years of Scientific Writing for Humanistic Study, S. Fischer et al. 2020. Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 794–802

  8. ^

    For visualization, entropy is normalized as entropy / log(vocab_size). This keeps the entropy sparkline below the text, because after normalization its range is 0..1. Surprisal is normalized like entropy, as surprisal / log(vocab_size), which tends to keep it below the text except for extremely high values. But that’s not such a bad thing, as it draws attention to them.

  9. ^

    I'm thinking of releasing it as a library called Sparky. I'll update this post if I do.

  10. ^

    Confusingly,  is the accepted symbol for surprisal rather than , because it's derived from "information content" or "self-information."

  11. ^

    Andrej Karpathy explores failure modes that arise from subword tokenization in the superb video Let’s build the GPT Tokenizer. Seek to 4:20 (1.5 min) and 1:51:40 (19 min).

3 comments

Comments sorted by top scores.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2025-01-28T23:04:28.486Z · LW(p) · GW(p)

Training with 'patches' instead of subword tokens: https://arxiv.org/html/2407.12665v2

comment by Nathan Helm-Burger (nathan-helm-burger) · 2025-01-28T22:59:38.149Z · LW(p) · GW(p)

I know this isn't the point at all, but I kinda want to be able to see these surprisal sparklines on my own writing, to see when I'm being too unoriginal...

comment by Nathan Helm-Burger (nathan-helm-burger) · 2025-01-28T22:45:10.918Z · LW(p) · GW(p)

Minor nitpick: when mentioning the use of a model in your work, please give the exact id of the model. Claude is a model family, not a specific model. A specific model is Claude Sonnet 3.5 2024-10-22.