Posts

Characterizing stable regions in the residual stream of LLMs 2024-09-26T13:44:58.792Z
Evaluating Synthetic Activations composed of SAE Latents in GPT-2 2024-09-25T20:37:48.227Z
AISC project: TinyEvals 2023-11-22T20:47:32.376Z
Polysemantic Attention Head in a 4-Layer Transformer 2023-11-09T16:16:35.132Z
An adversarial example for Direct Logit Attribution: memory management in gelu-4l 2023-08-30T17:36:59.034Z
A circuit for Python docstrings in a 4-layer attention-only transformer 2023-02-20T19:35:14.027Z

Comments

Comment by Jett Janiak (jett) on Characterizing stable regions in the residual stream of LLMs · 2024-09-26T13:49:52.388Z · LW · GW

I believe there are two phenomena happening during training

  1. Predictions corresponding to the same stable region become more similar, i.e. stable regions become more stable. We can observe this in the animations.
  2. Existing regions split, resulting in more regions.

I hypothesize that

  1. could be some kind of error correction. Models learn to rectify errors coming from superposition interference or another kind of noise.
  2. could be interpreted as more capable models picking up on subtler differences between the prompts and adjusting their predictions.
Comment by Jett Janiak (jett) on AIS terminology proposal: standardize terms for probability ranges · 2024-08-30T17:55:15.339Z · LW · GW

Scott In Continued Defense Of Non-Frequentist Probabilities

Comment by Jett Janiak (jett) on Transformers Represent Belief State Geometry in their Residual Stream · 2024-05-17T10:08:29.713Z · LW · GW

This is such a cool result! I tried to reproduce it in this notebook
image.png

Comment by Jett Janiak (jett) on Transformers Represent Belief State Geometry in their Residual Stream · 2024-05-17T09:47:28.554Z · LW · GW

For the two sets of mess3 parameters I checked the stationary distribution was uniform.

Comment by Jett Janiak (jett) on A Comprehensive Mechanistic Interpretability Explainer & Glossary · 2023-10-09T08:41:10.458Z · LW · GW

The activation patching, causal tracing and resample ablation terms seem to be out of date, compared to how you define them in your post on attribution patching.