ParaScope: Do Language Models Plan the Upcoming Paragraph?

post by NickyP (Nicky) · 2025-02-21T16:50:20.745Z · LW · GW · 0 comments

Contents

  Short Version (5 minute version)
    Motivation.
    Brief Summary of Findings
  Long Version (30 minute version)
    The Core Methods
      Models Used
      Dataset Generation
  ParaScopes:
    1. Continuation ParaScope
    2. Auto-Encoder Map ParaScope
      Linear SONAR ParaScope
      MLP SONAR ParaScope
  Evaluation
    Baselines
      Neutral Baseline / Random Baseline
      Cheat-K Baseline
      Regeneration
      Auto-Decoded.
  Results of Evaluation of ParaScopes
    Scoring with Cosine Similarity using Text-Embed models
    Rubric Scoring
      Coherence Comparison
      Subject Maintenance
      Entity Preservation
      Detail Preservation
    Key Insights from Scoring
  Other Evaluations
    Which layers contribute the most?
    Is the"\n\n" token unique? Or do all the tokens contain future contextual information?
      Manipulating the residual stream by replacing "\n\n" tokens.
    Further Analysis of SONAR ParaScopes
      Which layers do SONAR Maps pay attention to?
    Quality of Scoring - Correlational Analysis
      How correlated is the same score for different methods?
      How correlated are different scores for the same method?
  Discussion and Limitations
    Acknowledgements 
  Appendix
    HyperParameters for SONAR Maps
      Linear Model
      MLP Model
    ParaScope Output Comparisons
    BLEU and ROGUE scores
None
No comments

This work is a continuation of work in a workshop paper: Extracting Paragraphs from LLM Token Activations, and based on continuous research into my main research agenda: Modelling Trajectories of Language Models [LW · GW]. See the GitHub repository for code additional details.

Looking at the path directly in front of the LLM Black Box.

Short Version (5 minute version)

I've been trying to understand how Language models "plan", in particular what they're going to write. I propose the idea of Residual Stream Decoders, and in particular, "ParaScopes" to understand if a language model might be scoping out the upcoming paragraph within their residual stream.

I find some evidence that a couple of relatively basic methods can sometimes find what the upcoming outputs might look like. The evidence for "explicit planning" seems weak in Llama 3B, but there is some relatively strong evidence of "implicit steering" in the form of "knowing what the immediate future might look like". Additionally current attempts have room for improvement.

Motivation.

When a Language Model (LM) is writing things, it would be good to know if the Language Model is planning for any kind of specific output. While the domain of text with current LMs is unlikely to be that risky, they will likely be used to do tasks of increasing complexity, and may do things that may increasingly be described as "goal-directedness [AF · GW]" by some senses of the word. 

As a weaker claim, I simply say that LMs are likely steering, either implicitly or explicitly, towards some kinds of outputs. For some more discussion on this, read my posts on Ideation and Trajectory Modelling in Language Models [AF · GW].

We say that the language model is steering explicitly towards certain outcomes, if the model is internally representing some ideas of what longer-term states should look like. If given the same input prompt and context, the model should reliably try to steer towards some kinds out outputs, even if perturbed.

We say the language model is steering implicitly, if the language model does NOT have an explicit representation of what the end states might look like, but is rather following well-learned formulas to go from the current state to the next. This means that the model has some distribution of output states that are likely, but if perturbed, may go down some other path instead.

My original plan before I reached these results, was that it would be quite good to be able to map some distribution of what possible future outcomes might look like. Here is a visualization:

Illustration of what a "Model Trajectory Map" could look like in the future.

The experiments here so far only work on understanding what internal "immediate next step" planning might be going on in the model, and are probably more in-line with implicit steering rather than explicit steering. The results as follow show the current methods I have tried, with promising initial results despite not that much work being done. I expect with further work these ideas could be refined to work better.

 

Brief Summary of Findings

ParaScope Illustration. We read the residuals stream at the "\n\n" transition between some first paragraph and some second paragraph, and reconstruct what the language model was going to say.

We suggest two kinds of Residual Stream Decoders that try to reveal to what degree the Language model is planning the upcoming paragraph (where ParaScope = "Paragraph Scope"): 

  1. The Continuation ParaScope is a straightforward approach that takes the residual stream from a "\n\n" token and uses it as context for continued generation, testing whether the model can recover the intended next paragraph.
  2. The AutoEncoder Map ParaScope takes a more sophisticated approach by mapping the residual stream to Text AutoEncoder Embeddings using a trained map (Linear or MLP), then decoding it back to text.

Both methods demonstrated performance comparable to having five tokens of "cheat" context, suggesting that language models seem to "plan" or "keep context of" at least that far in advance.

SONAR ParaScope and Continuation ParaScope showed different strengths in their performance. SONAR ParaScope performed better at maintaining broad subject matter (83.2% vs 52.5% similar field or better), while the Continuation ParaScope was typically more coherence, being more often at least partially coherence (96.7% vs 72.2%).

These results significantly outperformed the random baseline (which achieved only 5.0% bread subject alignment), while falling well short of the full-context regeneration baseline (which achieved 99.0% broad subject alignment).

Additionally, a (very brief) analysis of the model's layers seems to indicate that the middle layers (60-80% depth) had the strongest impact on generation. 

Lastly, we found some evidence that that models we looked at typically avoids thinking about the next paragraph until the last moment, with the "\n\n" token typically marking the beginning of next-paragraph information encoding. 

We hope that this work can be expanded on the build better understanding of the degree of internal-planning in Language Models, and help us assess and monitor Language Model behaviour.

 

Long Version (30 minute version)

The Core Methods

ParaScope Illustration. We read the residuals stream at the "\n\n" transition between some first paragraph and some second paragraph, and reconstruct what the language model was going to say.

We know transformers process text token-by-token, but I intuitively feel they must be doing some form of planning to maintain coherence. I wanted to test a specific hypothesis: that significant information about an upcoming paragraph is encoded in the residual stream of the newline token that precedes it. 

To investigate this:

  1. I generated a dataset of Language-Model-Generated text with relatively clear transitions. (This mostly just excludes "code").
  2. I tried a couple of ways of extracting information from the residual stream at the transition token

I coin the following terms:

Here I propose two simple ParaScope methods, detailed in more depth below:

1) Continuation ParaScope: Continuing generation with the original model where the residual of the token to be decoded as context.

2) Text Auto-Encoder Map ParaScope: Training a map from residual stream to the embedding space vector generated by a Text Auto-Encoder [LW · GW] model (the input to the Text AutoEncoder's Encoder is the desired paragraph), then using the decoding the embedding vector with the AutoEncoder's Decoder.

For the text, we use the SONAR text auto-encoder, as it seems to be the best from my brief analysis [LW · GW],  

 

Models Used

The model I used for experiments was Llama-3.2-3B-Instruct. The small model makes it inexpensive to run for large numbers of samples. All generation with Llama use temperature=0.3.

Additionally, I used SONAR embedding models as the language model auto-encoder.

I used all-mpnet-base-v2 as the text-embed comparison model.

 

Dataset Generation

Diagram explaining the generation process of the diagram. We take text from some original dataset, RedPajama's CommonCrawl, and ask the model to generate a prompt related to the text. We then use that generated prompt to get a generate output using the model.

The key challenge was getting clean paragraph transitions that would let us study how models encode upcoming content. In previous attempts at "dataset creation", I often had issues with lack of diversity. My latest solution is a two-stage process:

First stage (prompt generation):

Second stage (content generation):

Why this approach: By getting text generated by the LLM we want to study, we have a guarantee that the model was actually pretty likely to generate the upcoming paragraph. If we used an existing dataset, then we would have much weaker guarantees.

In total, I generated 100,000 responses, which splits into approximately 1M "paragraphs" when split into by occurrences of "\n\n". I used a split of 99% train, 1% test.

For more details, see the GitHub Repository, or view the dataset on HuggingFace Hub.

 

ParaScopes:

We want to test the degree to which the residual stream has information about the whole upcoming paragraph, not just the next token. To do the residual stream decoding, we first collect the residual stream activations at the transition points. That is, we:

Our task now is to be able to translate each transition token's residual stream activations into the next paragraph of text. 

 

1. Continuation ParaScope

Continuation ParaScope. Use the original model to continue but only using the a single token's residual stream as context.

The continuation ParaScope is probably the least complicated, has no hyper-parameters, and was what I used in my Extracting Paragraphs paper. The core idea:

We then compare to the baselines and the "Continuation ParaScope" paragraphs to see how well we managed to decode the residual stream.

 

2. Auto-Encoder Map ParaScope

SONAR Probe ParaScope. We train a Linear or MLP map to translate from residual stream to SONAR Embedding vector. We then decode the embedding vector with SONAR decoder. 

The AutoEncoder Map ParaScope consists of two stages:

In this post, we use SONAR as the text auto-encoder. SONAR is a text-embed auto-encoder that was trained for purposes of machine translation. The key property it has over almost all other text-embed models, is that it comes not only with an encoder, but also a decoder.

Thus, one can pass in a short passage, encode into a vector of size 1024, then later decode that vector to get a passage that should be an almost-exact match to the original passage. Read this post on Text AutoEncoders [LW · GW] if you want more details.

 

 

As we will mostly stick to using SONAR, we will mostly refer to the AutoEncoder Map ParaScopes as SONAR ParaScopes.

For training the residual-to-sonar map, we:

We train two types of maps: 

Linear SONAR ParaScope

1) The Linear map takes the normalized residuals of the first 50% of residual diffs, and outputs a normalized sonar vector, but basically: Normalize( n_layers * d_model ) → d_sonar = 1024

MLP SONAR ParaScope

2) The MLP map additionally has two hidden layers of size 8192. See Appendix[1] for hyper-parameters, but basically: Normalize( n_layers * d_model ) → GeLU( 8,192 ) → GeLU( 8,192 ) → 1,024

 

Evaluation

Baselines

In order to see how good our ParaScope probing methods are, we need some baselines to compare against. We choose a various few baselines:

Neutral Baseline / Random Baseline

For this, we simply generate with the model being given a blank <bos>\n\n context, no interventions. This should work pretty poorly, but will give a lower bound for how bad our methods could be

Cheat-K Baseline

For this, we let the model do the baseline generation, but let the model generate with a few tokens of the upcoming section revealed for context. For example, if the original section was ** Simple Recipe for an Organic One-Pot Pumpkin Soup **, for cheat-1, we would input: <bos>\n\n**, for cheat-5 we would input <bos>\n\n** Simple Recipe for an

For sake of "fairness" of comparisons, filtered examples where more than 50% of the text was exposed (that means, for cheat-10, it must be at least 20 tokens long).

We use cheat-1, cheat-5, and cheat-10 as our comparisons.

Regeneration

Another thing we can compare against is taking the whole previous context: prompt + previously generated sections, and using the model to generate what could have come next again (again, with temperature=0.3). This gives us the real distribution, and is what we want to match

Auto-Decoded.

Finally, another method for comparison is to take the reference text, encode it using SONAR, and then decode it again using SONAR. As SONAR is not perfectly deterministic, it should give some idea of what the limitations might be.

 

Results of Evaluation of ParaScopes

There were a few main ways of evaluating the ParaScopes of the residual stream:

Each has their pros and cons. None of them really give a complete picture, but I think the LLM rubric scoring seems to work the best of them. 

 

Scoring with Cosine Similarity using Text-Embed models

The simplest way to compare the texts, is to use a text-embed model, and compare the embedding vector using cosine-similarity against the reference text (we used all-mpnet-base-v2, a relatively small text embed model). 

Here, the Linear and MLP ParaScope models seem to work almost identically well, with perhaps a tiny edge to the MLP model. Both seem better than the ParaScope continuation by about ~0.2 in cosine similarity on average. They seem comparable to the "cheat-5" baseline.

Comparison of Cosine Similarity between different methods and baselines with the original reference generated text.
 mlplinearcont.base.c-1c-5c-10regen.auto.
Mean0.510.500.330.080.120.440.630.820.95
StDev±0.20±0.20±0.20±0.08±0.14±0.23±0.20±0.19±0.13

This is fine and all, but I don't find the scores particularly easy to understand. It seems like maybe ParaScope Linear/MLP are better but hard to tell. Here is a table of ths scores:

 

Additionally, I try with BLEURT score, which is also an imperfect metric, but better than the BLEU or ROGUE scores (see Appendix for those[3]) but still shows results mostly consistent with "around the same level as cheat-5, worse than regeneration". We see regeneration suffers here more compared to Auto-decoded. I still think these results are not that meaningful. 

Comparison of BLEURT scores compared to the original reference generation.

Metric

mlp

linear

cont.

base

c-1

c-5

c-10

regen.

auto

Mean0.4040.3950.3640.1920.2630.3960.4770.6190.825
StdDev±0.144±0.138±0.135±0.094±0.107±0.102±0.093±0.216±0.145

 

Rubric Scoring

To do a more fine-grained evaluation, I also scored the outputs using GPT-4o-mini with a rubric for different aspects of the text. This procedure was somewhat tuned until it was mostly consistent at giving scores that match my intuitions.

In particular, one thing that helped get reliable scoring, was to ask the model to first give brief reasoning about the scoring, of format that is "[reasoning about text], Thus: [explicit name of score which applies]: Thus: out of [max score], score X"

A TLDR of the axes which showed the most interesting results:

For a full breakdown of the rubric, as well as additional metrics collected, see the github.

What we mostly see is that the MLP and Linear SONAR ParaScope methods seem to work almost identically well, and there are different tradeoffs between Sonar ParaScopes and Continuation ParaScope.   

 

Coherence Comparison

Coherence of text. 0: Incoherent (nonsense/repetition) | 1: Partially coherent (formatting issues) | 2: Mostly coherent (minor errors) | 3: Perfect flow (logical progression)

SONAR ParaScope: Achieves 72.2% partial coherence (Score 1+) but only 1.9% mostly coherent (Score 2), performing just slightly above cheat-1 baseline (68.4% Score 1+) and far below other baselines.

Continuation ParaScope: Reaches 96.7% partial coherence (Score 1+) and 53.0% mostly coherent (Score 2), performing similarly to cheat-10 (98.9% Score 1+, 50.2% Score 2).

The regenerated baseline achieves 98.3% partial coherence and 81.2% mostly coherent, showing room for improvement particularly at higher coherence levels.

 

Subject Maintenance

-1: No subjects | 0: Unrelated ("law" vs "physics") | 1: Similar field (biology/physics) | 2: Related domain (history/archaeology) | 3: Same subject (ancient mayans) | 4: Identical focus (mayan architecture)

SONAR ParaScope: Shows strong topic relevance with 83.2% similar field or better (Score 1+) and 78.3% related domain or better (Score 2+), outperforming even cheat-10 (84.0% Score 1+, 51.3% Score 2+).

Continuation ParaScope: Achieves 52.5% similar field or better (Score 1+) and 42.7% related domain or better (Score 2+), performing slightly below cheat-5 (62.2% Score 1+, 55.3% Score 2+).

The regenerated baseline almost always achieves similar field 99.0% (score 1+) and related domain 98.65% (Score 2+), indicating there is significant room for improvement, but that they significantly outperform the random baseline which only gets 5.0% score 1+.

 

Entity Preservation

-1: No entities | 0: Unrelated (Norway/smartphone) | 1: Similar category (countries/people) | 2: Related entities (similar countries) | 3: Partial match (some identical) | 4: Complete match (all key entities identical)

SONAR ParaScope: Maintains 57.1% similar category or better (Score 1+) and 19.7% related entities or better (Score 2+), performing between cheat-5 (44.4% Score 1+) and cheat-10 (75.5% Score 1+).

Continuation ParaScope: Shows 29.7% similar category or better (Score 1+) and 6.4% related entities or better (Score 2+), falling well below cheat-5 performance (44.4% Score 1+).

 

Detail Preservation

-1: No details | 0: Different details | 1: Basic overlap ("anti-inflammatory") | 2: Moderate match (benefits + facts) | 3: Specific match (exact percentages)

SONAR ParaScope: Reaches 59.2% basic overlap or better (Score 1+), similar to cheat-5 (54.9%), but drops to 3.6% moderate match or better (Score 2+), well below cheat-5 (17.2%).

Continuation ParaScope: Achieves 52.3% basic overlap or better (Score 1+) and 15.4% moderate match or better (Score 2+), performing close to cheat-5 levels (54.9% and 17.2%

 

 

Key Insights from Scoring

The methods show clear complementary strengths: SONAR better preserves broad semantic relationships despite training mismatch, while continuation maintains better coherence and specific details. Performance against baselines reveals significant room for improvement, particularly in entity and detail preservation.

Overall, the ParaScopes seem to be similar in quality to cheat-5 generations.

SONAR ParaScope coherence issues likely stem from its training objective (MSE) and could be improved through better loss functions (using CE Loss of decoded outputs) or fine-tuning SONAR to better handle the noise we have here. There are likely also other improvements one could do for the residual-to-sonar maps.

Continuation ParaScope issues seem to mostly stem from not being able to cling onto the information that it has it's context. It's possible that re-scaling the activations from the transferred activations to be more dominant, or allowing multiple generations and doing some kind of "Best of" generation would show better results.


Other Evaluations

Some questions remain. In particular, one question to try understand, is how far ahead this "context building" or "planning" might be occurring? I try to find out by testing Continuation ParaScope on different close to the transition token "\n\n", and also try to see how what layers of the model might this "planning" be occurring in.

To do these more specific evaluations, I used a narrower dataset, as was used in the original "Extracting Paragraphs" paper. That is, we had 20 types of outputs, which were specifically of the form "Write a paragraph about X, then write a a paragraph about Y. Do not mention X or Y explicitly.". This makes it easier to control that the next paragraph is different to the previous, and also prevent the model from over-fitting on the first token of the output typically naming X or Y. 

Additionally, I ran the tests on Gemma-2-9B-it instead of Llama-3.2-3B-Instruct. This is because the smaller 3B model usually struggled to do the task of talking about X without naming X.

 

Which layers contribute the most?

Illustration of Layer Scrubbing Method to determine contribution of different layers

I try doing a basic form of "Layer Scrubbing" to find what layers are used. For this part, I only used a single comparison example A vs B, where:

The "Layer Scrubbing" then involved replacing the Whole residual stream activations at different layers (not just the differences):

I found that generating a single output was quite noisy, and so to ensure that we are measuring "information available" rather than "probing quality", I ran each continuation 64 times, and got the best result when comparing against A and against B.

Below, we see the results, comparing the cosine similarity of the outputs with different layer scrubbing:

"Layer Scrubbing", comparison of best-case generation similarity between outputs when layers [0-index] are from one prompt residual stream, and [index-n_layers] are from the other prompt residual stream.

From this, we can see that the first 25 layers out of 42 have little effect on generation, whereas between layer 25 and 35 makes a huge to the output, and we also see a significant jump at the final layer, as when this is replaced, the "output embeds" get replaced and the first generated token changes.

I will need to more thoroughly test this by using more than a single pair of examples, but overall, this seems somewhat in line with other research showing middle layers having more of the impact on "high level" thinking. 

 

Is the"\n\n" token unique? Or do all the tokens contain future contextual information?

We use Continuation ParaScope on various tokens within the transformer block, to see how much of each context it contains.

How much do different tokens include information about the upcoming paragraph? How does it differ before vs after the paragraph begins generation? We try using the "Continuation ParaScope" on the residual stream of tokens within a range of the transition "\n\n" token. More specifically, the procedure is:

 

Here is an example between two specific texts (blockchain vs ancient mayan architecture):

We can see that before the "\n\n" token, the model basically never outputs anything even vaguely similar to "ancient Mayan architecture", and often outputs things that are somewhat similar to the blockchain paragraph. After the newline, it consistently outputs things that are judged to be very similar to ancient Mayan architecture and dissimilar to blockchain.

 

How does this fare if we average across the 20 text example that we have? We see results below:

This seems to indicate that the "\n\n" token seems to be when the model starts thinking about the next paragraph, but not when it stops. The similarity to topic 1 is not that high, even before the "\n\n", likely indicating that the metric against topic 1 seems to be poor, but the results for for topic 2 show a clear transition.

From these basic experiments, it seem that it is not exclusively the "\n\n" token that has enough contextual information to get a good grasp of the output. It more seems like the "\n\n" token is simply the first of many tokens to contain information about the upcoming paragraph. Tokens before seem to contain very little, whereas tokens after seem to contain the same or more information about paragraph.

This poses that it may be challenging if one might be interested in observing what future outputs might be like before we reach them

 

Manipulating the residual stream by replacing "\n\n" tokens.

We can also try see how the model deals with scenarios where we insert a "\n\n" token randomly before the end of the generation. That is, consider we have a text with two paragraphs, that start and end like such:

We can instead insert a "\n\n" at some point mid sentence. For example:

One might expect two main things to happen. Either the model notices the early "\n\n" and tries to recover, to get back the original text but continued. Alternatively, the model might think that the paragraph is done, and that the model can continue.

If we look at a single text, we can see that it is very token dependent what the model decides to focus on internally.

How well does transfer generation work if we replace the token we get the residual stream from with "\n\n" at different positions? We see that it depends on the token.

 

If we average across 20 different texts, we can see that the "highest" similarity is near the transition, and both before and after the original position of the double newline token, the model begins generating what it was originally going to generate for the second paragraph, but further away it errs towards something else.

Averaging the "replace token with \n\n before doing Continuation ParaScope" across 20 texts, we see that the cosine similarity to paragraph 2 peaks at token = 0, with decline either side of \n\n. 

I will likely need better metrics before I can say more specifically what is going on, since it may just be an oddity of the cosine similarity metric. Overall, we can see that the "\n\n" token does have some significance, but it is not uniquely significant, and the results are somewhat noisy.


Further Analysis of SONAR ParaScopes

An additional concern might be, how rubust are these SONAR ParaScopes? What layers do they use for predictions? Might the SONAR ParaScopes be overfitting?

 

Which layers do SONAR Maps pay attention to?

I try splitting the Linear SONAR ParaScopes into a weight matrix from layer, and get the Frobenius norm of the resulting layer matrices. This should give some information about which layers the linear map thinks has the most useful information.

Frobenius Norm of Linear Map weights, when we split the contributions by residual stream layer. We see that weights mapping from Attention Out to SONAR are slightly higher than weights mapping from MLP Out to SONAR 

We see that, in general, the SONAR maps tend to put more emphasis on the Attention layers over the MLP layers. The difference is noticeable but not huge (3.56 Attention vs 3.39 MLP in layer 16). 

I suspect having better weight decay or something similar would probably result in more fine-grained predictions, or possibly there are better metrics than just Frobenius norm.

 

Quality of Scoring - Correlational Analysis

To what degree are the different Residual Stream Decoders, SONAR ParaScope and Continuation ParaScope, correlated or uncorrelated at getting the right answer? There are various correlation metrics like Pearson, Spearman and Kendall-Tau, which have their uses, but they mostly gave similar results, but I show Kendall Correlation below.

How correlated is the same score for different methods?

The simplest "score" we can correlate things between, is to compare lengths of generation. We also compare against scoring along "subject match". We note that for the methods that are any good, the lengths are quite correlated. This at least shows things are OK.

We can generally see that the MLP and Linear residual-to-sonar maps have very correlated scoring, more-so than most other methods between each other. For other scoring metrics, such as subject match, the correlation is much less. MLP and Linear maps seem to have high correlation, but other methods not so much. In particular, regenerated and auto-decoded have low correlation, likely because they both score almost perfectly almost all the time.

 

How correlated are different scores for the same method?

We also check if the different scores are capturing different things, and if there are correlated failures that do not exist in the data. 

Spearman correlation between different metrics for Linear ParaScope, Continuation ParaScope, and Full-context Regenerated Baseline.  

It seems like, overall, Linear seems to have higher inter-score correlations, likely because there are more cases where the output is completely incoherent, and thus, all the scores drop together. The scoring correlation seems similar-ish between Continuation ParaScope and the Regenerated baseline. 

 

To me, these results seem to show that the Continuation ParaScope and SONAR ParaScope seem to work pretty independently of each-other, and we also see that the SONAR ParaScopes seem to mostly fail due to many incoherent outputs.


Discussion and Limitations

Overall, we can see there is clear evidence that the model, Llama-3.2-3B-Instruct, is doing at least some "planning" for the immediately upcoming text, and this makes sense. The two Residual Stream Decoders that we propose: Continuation ParaScope and SONAR ParaScope, both seem to be able to sometimes, but not always, capture what the upcoming paragraph is likely to look like. 

The simplicity of these methods likely indicates that these are not likely to be over-fitting on the residual stream data, showing a proof of concept, and additional optimization seems like it would likely lead to far better Residual Stream Decoders. 

There is much work that could be done, both for immediate ParaScope Residual Stream Decoders, and for longer-horizon Residual Stream Decoders.

Potential future work on ParaScopes:

Future work on broader Residual Stream Decoders:

Overall, I am hopeful that there are some interesting results to be had via the Residual Stream Decoders approach, which could be used for better testing and monitoring AI systems in the future.

If you think this work is interesting, and would like to learn more or potentially collaborate, feel free to reach out by messaging me on LessWrong or scheduling a short call.  

Acknowledgements 

 Thanks to my various collaborators: Eloise Benito-Rodriguez, Angelo Benoit, Lovkush Agarwal, Zainab Ali Majid, Lucile Ter-Minassian, Einar Urdshals [LW · GW], Jasmina Urdshals [LW · GW], Mikolaj

Appendix

  1. ^

    HyperParameters for SONAR Maps

    For the MLP and Linear SONAR Maps, I did a quick sweep over hyper-parameters with wandb. I found that using skipping the first half of layers (0-15) and only using the second half (layers 16 - 27) worked almost identically well as the full model, or slightly better, but trained a good bit faster. This is in line with the section above: "Which layers contribute the most?"

    Linear Model

    Normalize( 61,440 ) → 1,024

    Train on a batch size of 1,024. Use the last 50% of residual diff layer activations. The learning rate is set to 2e-5, learning rate decay of 0.8 per epoch for 10 epochs. Weight decay of 1e-7 also applied. Achieved a training loss of 0.750 and validation loss of 0.784. 

    MLP Model

    Normalize( 61,440 ) → GeLU( 8,192 ) → GeLU( 8,192 ) → 1,024

    Train on a batch size of 1,024 with two MLP hidden layers of dimension 8,192. . The learning rate is set to 2e-5, with a decay of 0.8 per epoch over 10 epochs. A higher weight decay of 2e-5 a dropout of 0.05 is applied for regularization. The model achieved a training loss of 0.724 and a validation loss of 0.758, showing consistent performance with slight overfitting.

  2. ^

    ParaScope Output Comparisons

    OriginalBlack seed oil is a rich source of essential fatty acids, particularly oleic acid, linoleic acid, and alpha-linolenic acid. The oil is also a good source of vitamins, minerals, and antioxidants. Accor...
    LinearBlack pepper oil mixture is a nutrient rich oil of black pepper and acetaldehyde, which contains an average of 50 grams of vitamin A. The black pepper oil mixture is characterized by oil richness and...
    Original* **Rocketman Sing-A-Long** - July 12th at 7pm and 9pm at Regal Cinemas and AMC Theatres Join us for a special sing-along screening of the critically-acclaimed biopic Rocketman, starring Taron Egerton
    Auto-decoded:* "Rockin' Man" 12:00pm - 10:30pm * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
    Original: According to Cawley, the outage was caused by a combination of factors, including worn-out equipment and a power surge that affected multiple substations in the area. "Our team worked diligently to re...
    Auto-decoded:According to Cawley, the failure was the result of a combination of equipment, including obsolete equipment and overloading of the area with multiple power stations. "Our team worked diligently to con...
    Original: At just 21 years old, Sigrid Lid Æ Pain, known professionally as Sigrid, has already made a significant impact on the music industry. The Norwegian pop sensation's rise to fame has been nothing short...
    Auto-decoded:With only 21 years old, Sigrid Liddig Seer, known professionally as Sigrid, has already had a significant impact on the music industry. The rise of the Norwegian pop sensation has been nothing short o...

    I don't want to clutter the post by adding a bunch of examples here, so see the GitHub for more.

  3. ^

    BLEU and ROGUE scores

    Here are the BLEU or ROUGE scores or similar, but the task here is not really translation, and I want to have more room for fuzzy scoring, so I didn't think it would be super relevant here. Regardless, these are considered standard so I include them.

    Method

     

    SacreBLEU

    Mean ± StDev

    ROGUE-1

    F1 Score

    ROUGE-2

    Precision

    ROUGE-L

    F1 Score

    mlp2.92 ± 2.810.27 ± 0.120.07 ± 0.070.19 ± 0.08
    linear2.81 ± 2.820.27 ± 0.120.07 ± 0.070.19 ± 0.08
    continued4.42 ± 4.920.27 ± 0.100.06 ± 0.070.18 ± 0.07
    baseline0.60 ± 0.670.10 ± 0.080.01 ± 0.020.07 ± 0.06
    cheat-11.41 ± 1.530.16 ± 0.080.02 ± 0.040.12 ± 0.06
    cheat-57.58 ± 5.350.29 ± 0.090.14 ± 0.140.23 ± 0.07
    cheat-1015.91 ± 8.570.38 ± 0.100.25 ± 0.150.32 ± 0.09
    regenerated22.65 ± 19.640.50 ± 0.170.28 ± 0.210.38 ± 0.18
    auto-decoded58.27 ± 24.320.81 ± 0.180.69 ± 0.210.78 ± 0.20

0 comments

Comments sorted by top scores.