Posts
Characterizing stable regions in the residual stream of LLMs
2024-09-26T13:44:58.792Z
Evaluating Synthetic Activations composed of SAE Latents in GPT-2
2024-09-25T20:37:48.227Z
AISC project: TinyEvals
2023-11-22T20:47:32.376Z
Polysemantic Attention Head in a 4-Layer Transformer
2023-11-09T16:16:35.132Z
An adversarial example for Direct Logit Attribution: memory management in gelu-4l
2023-08-30T17:36:59.034Z
A circuit for Python docstrings in a 4-layer attention-only transformer
2023-02-20T19:35:14.027Z
Comments
Comment by
Jett Janiak (jett) on
Characterizing stable regions in the residual stream of LLMs ·
2024-09-26T13:49:52.388Z ·
LW ·
GW
I believe there are two phenomena happening during training
- Predictions corresponding to the same stable region become more similar, i.e. stable regions become more stable. We can observe this in the animations.
- Existing regions split, resulting in more regions.
I hypothesize that
- could be some kind of error correction. Models learn to rectify errors coming from superposition interference or another kind of noise.
- could be interpreted as more capable models picking up on subtler differences between the prompts and adjusting their predictions.
Comment by
Jett Janiak (jett) on
AIS terminology proposal: standardize terms for probability ranges ·
2024-08-30T17:55:15.339Z ·
LW ·
GW
Comment by
Jett Janiak (jett) on
Transformers Represent Belief State Geometry in their Residual Stream ·
2024-05-17T10:08:29.713Z ·
LW ·
GW
This is such a cool result! I tried to reproduce it in this notebook
Comment by
Jett Janiak (jett) on
Transformers Represent Belief State Geometry in their Residual Stream ·
2024-05-17T09:47:28.554Z ·
LW ·
GW
For the two sets of mess3 parameters I checked the stationary distribution was uniform.
Comment by
Jett Janiak (jett) on
A Comprehensive Mechanistic Interpretability Explainer & Glossary ·
2023-10-09T08:41:10.458Z ·
LW ·
GW
The activation patching, causal tracing and resample ablation terms seem to be out of date, compared to how you define them in your post on attribution patching.