Mechanistic Exploration of Gemma 2 List Generation

post by Gerard Boxo (gerard-boxo) · 2024-10-14T17:04:57.010Z · LW · GW · 0 comments

This is a link post for https://gboxo.github.io/2024/10/03/Gemma2-Lists.html

Contents

  Summary
      Preface
  Data Generation
      We ask GPT4-o to provide a list of topics to create lists about.
      We create several templates for Gemma to generate lists:
      For each topic, we sample 5 Gemma Completions with top-k = 0.9 and temperature=0.8
    The mysterious filler token 
    Metric 
    Attribution of the Proxy Metric in the feature basis.
    Visualization of Features
  Causal ablation Experiments
      Method
      Metrics
    Results
    Conclusions and Future Directions
    Future Work
None
No comments

Summary

Preface

This post is a followup, of a 10h research sprint for a failed MATS application, the structure and methods are similar, but the content has been substantially expanded.

 

In this post, we will explore the mechanisms behind Gemma 2 2b’s ability to create lists of items, when prompted to.

Specially, we are interested in the mechanism by which Gemma knows when to end a list.

To do so, we leverage SAE and techniques like gradient-based attribution and causal ablations.

 

 

Data Generation

To investigate the behavior of Gemma when asked for a list, we create a synthetic dataset of model responses to several templates.

We ask GPT4-o to provide a list of topics to create lists about.

This results in 23 final topics, that we will ask Gemma 2 to create lists about.

Some examples, of those are: Vegetables, Countries, Movies, Colors, etc

We create several templates for Gemma to generate lists:

Template 1 (Base)

Provide me a with short list of {topic}. Just provide the names, no need for any other information.

Template 2 (Contrastive)

Provide me a with long list of {topic}. Just provide the names, no need for any other information.

Template 3

Provide me a with list of {topic}. Just provide the names, no need for any other information.

If not otherwise indicated, all the analysis and explorations were done with Template 1

For each topic, we sample 5 Gemma Completions with top-k = 0.9 and temperature=0.8

 

The generated outputs look like the following, colored by token/position type.

Entropy Item Last Token Long Clean
Structure of the Generated List, some of the tokens are in between tags for clarity.

The mysterious filler token 

By far, the most interesting behavior that we’ve observed trough different topics and temperatures is the model behavior of including blank tokens near the end of the list. The model includes white space tokens at the very last item position (like the Celery example) or in some cases the last few tokens.

Metric 

We use a proxy metric to capture the "List ending behavior", concretely the Logit Difference between the blank and line break token in the last token position of the last item in the list (the Celery position in the above example).

Full explanation in the link post.

 

In the investigation we selected 5 layers for RS SAEs and 5 layers for Attn SAEs.

SAELayers
Attn2, 7, 14, 18, 22
RS0, 5, 10, 15, 20

All the SAEs correspond to the 16k series, and have the sparsity of the ones present in Neuronpedia.

All the experiments with SAEs are done in a one-by-one basis.

Attribution of the Proxy Metric in the feature basis.

Given the vast number of features that get activated in the dataset it’s important to pin down the most important features for the behavior we are investigating, to do so we use Attribution techniques.

Concretely, we want to obtain the most important features that explain a given metric. This metric should reflect the behavior we are interested in.

In this case, given that we are interested in the “list ending behavior” a suitable metric can be the logit difference between the blank token and the line break token logits, at the position just before the end of the list.

We run attribution experiments for all the Attention and RS SAEs (one by one) across all the dataset, and keep the top (pos, feature) tuples.

Visualization of Features

We introduce a kind of plot that resembles Gene Expression Matrices, that enable us to detect local communities of features that are specially present in a subset of examples.

 

Feature Expression for important features in RS 5, across the whole dataset.

  • In the bottom left corner, we can see a local community.
  • The large horizontal area represents a subset of features that are important across the whole dataset.

 

Causal ablation Experiments

Experiments must be performed to empirically test whether the top important features affect in the expected way the behavior of interest.

To this end we performed causal ablation experiments that consisted in setting to 0 the top 5 most important (pos,features) and generating completions. This allows us to investigate whether the list ending behavior was maintained.

Method

Metrics

 

Results

 

We report the average difference in key metrics between the Ablated Generation and the Base Generation.

The full results are in tables in the link post.

If we average the ablation effects on generation across topics, we get the following key metrics.

 

 diversity_meandiversity_variancen_items_meann_items_variance
attn_layer_140.1114430.0480281.9630436.213043
attn_layer_180.0650510.0285950.7181161.526087
attn_layer_20.2219690.0649553.2652178.373913
attn_layer_220.0697070.0355570.9898552.765217
attn_layer_70.1771790.0430481.9833333.073913
res_layer_00.6283970.32512922.22173982.069565
res_layer_70.3291850.0805167.09855121.591304

 

One impressive example of this, is the SAE in the RS in layer 0, upon close inspection, the top-5 features for this SAE were very related to the short token, after ablation the average number of items is close to the Template 2.

For the other SAEs the resulting number of tokens was close to the Template 3.

The SAEs with less than 1 extra item in average can be classified as not effective at suppressing the list ending behavior.

Conclusions and Future Directions

In summary, applying Mechanistic Interpretability techniques in real world problems is not straight forward, and many compromises and assumptions must be made. This work pretended to be a first approximation in using MI techniques in scenarios that more resemble the real world.

Even though some amount of progress has been made into attacking this problem, this investigation falls short to the goals of mechanistic interpretability.

While the empiric results are encouraging, (full elimination of the list ending behavior with just 5 edits), the results are not fully satisfactory.

Even though ablating some key features prevents the list ending behavior, other simple methods do it also like restricted sampling, or steering vectors.

Future Work

Several areas for future research have been identified:

  1. Benchmarking different heuristics for selecting layers based on memory versus circuit faithfulness.
  2. Applying established techniques for feature explanation to the relevant dataset.
  3. Utilizing Transcoders and hierarchical attribution to develop complete linear circuits.

0 comments

Comments sorted by top scores.