Sparse Coding, for Mechanistic Interpretability and Activation Engineering

post by David Udell · 2023-09-23T19:16:31.772Z · LW · GW · 7 comments

Contents

  Introduction
  Technical Argument from Sparse Coding Theory
  Autoencoder Interpretability
    Pythia 70M
    Llama-2 7B
    Neuron Interpretability Baseline
  Path to Impact: Learning Windows into Models?
  Conclusion
  Pythia 70M Autoencoder Data
    Layer 1
    Layer 2
    Layer 3
    Layer 4
    Layer 5
  Llama-2 7B Autoencoder Data
    Layer 13
None
7 comments

Especial thanks to Logan Riggs [LW · GW] and Monte MacDiarmid [LW · GW], for pointing me towards this whole research direction and for code discussion, respectively. Thanks to Alex Turner [LW · GW] for project feedback and for orienting me towards scaling activation engineering up to larger models. Thanks to Adrià Garriga-Alonso, [LW · GW] Daniel Kokotajlo, [LW · GW] Hoagy Cunningham, [LW · GW] Nina Rimsky, [LW · GW] and Garrett Baker [LW · GW] for discussion and/or draft comments. And thanks to anyone I discussed this with!

TL;DR: To separate out superimposed features represented by model neurons, train a sparse autoencoder on a layer's activations. Once you've learned a sparse autoencoding of those activations, this autoencoder's neurons can now be readily interpreted.

Introduction

All code hosted at this repository: activation_additions/sparse_coder

A bit ago, I became interested in scaling activation engineering to the largest language models I could. I was initially surprised at how effective the technique was for being such a naive approach, which made me much more enthusiastic about simple manipulations of model activation spaces.

Yudkowsky says that we cannot expect to survive without a mathematical understanding, a guiding mathematical framework, of the AI. One hunch you might have is that a linear feature combination theorem could be the root of such a guiding theory. If so, we might learn a lot about the internal learned mechanisms of models by playing with their activation spaces. I feel like tuned lens and activation additions [LW · GW] are some evidence for this hypothesis.

One major problem I experienced as I scaled up activation engineering to the largest models I could get my hands on (the new open-source Llama-2 models) was that it's hard to guess ahead of time which additions will work and which won't. You generate a new addition and stick it into a forward pass. Then, you get a few bits back observing how well the addition worked. "It would have been great," I thought, "to get a window into which concepts the model represents internally, and at which layer it does so."[1]

Sparse coding excited me at this point, because it suggested a way to learn a function from uninterpretable activations to represented, interpretable concepts! Paired with activation engineering's function from interpretable concepts to model internal activations, it sounded like a promising alignment scheme. Now, many things sound promising ahead of time. But seeing the MATS 4 Lee Sharkey team [LW · GW] get extremely clean, concrete results on Pythia drove my confidence in this path way up.

This is the writeup of that research path. I still think this is an extremely promising interpretability path, about as important as activation engineering is.

What I do is:

  1. collect model activations at a layer,
  2. train an autoencoder on those activations with an  sparsity penalty, and
  3. interpret the neurons of the trained autoencoder.

The neurons in the autoencoder then appear meaningful to top-token visualizations!

Technical Argument from Sparse Coding Theory

Epistemic status: Theoretical argument.

Say you collect a bunch of activation vectors from a particular layer of a trained model, during some task. These activations vectors are generally not natively interpretable. They're vectors in some space... but we have no real understanding of the meanings of that space's basis dimensions. We only know that all those activation spaces, passed through in sequence, yield coherent English speech. English concepts are being represented in there, internally, somewhere. But we don't really know how.

The problem is that there is no privileged basis in a transformer's activation space. The model was incentivized during training to learn every classifier it needed to mirror its training distribution. But there was no training incentive for each classifier to correspond to a single neuron. The training distribution is sparse: you don't need to be ready to represent each concept independently of every other concept. The training incentive actually weighed against the one-to-one neuron solution, then, as that's wasteful in weights. So there's plenty of mechanistic reason for a model's neuron activations to look like jumbled messes to us. To exploit a sparse world, learn densely compacted features.

And the solution we empirically see learned is indeed superimposed features! Don't dedicate a neuron to each feature. Have each neuron represent a linear combination of features. For this reason, all the directions in an activation space will tend to be polysemantic. If you just run PCA on an activation space, the resulting directions will often be frustratingly polysemantic.[2]

Sparse coding[3] is a solution to this superposition-of-features problem. You train autoencoders with an  sparsity penalty on the activations collected from a model layer. The autoencoder can be as simple as a tied matrix, then a ReLU, then the tied matrix transpose. The learned matrix together with the ReLU maps to a larger projection space. An  penalty is applied during training to autoencoder activations in this large projection space. The autoencoder is trained to reproduce the input activations while simultaneously respecting the  internal representation penalty.

We're interested in particular solutions to this formal problem: learn to give each feature a neuron, i.e., have features fall along the standard basis. This way, the  penalty gives good values: most of your autoencoder activation values will be precisely zero. (An penalty yields a constant negative gradient to the extent that there are non-zero elements in the autoencoder's activations.) If the activations vectors are just linearly superimposed feature dimensions, then separating them out and squeezing them back together in this way should reproduce the original vectors. That will satisfy the reproduction loss, too.

We train such an autoencoder to convergence, driving towards an  value of between  (in smaller models) and  (in larger models). We save the trained autoencoder and examine its standard basis. Empirically, these neuronal directions appear quite semantically meaningful!

Autoencoder Interpretability

Epistemic status: Experimental observations. There's a robust effect here... but my code could absolutely still contain meaningful bugs.

Pythia 70M

Let's examine autoencoders trained at each of Pythia 70M's layers. Our interpretability technique is checking which tokens in the prompt most activate a given autoencoder neuronal direction.

For each Pythia autoencoder, here are ten unsorted non-zero directions and their favorite tokens:[4]

Layer 1
DimensionTop Input Tokens
2holding,  speak,  remember,  read,  learn,  hears
11:, )?
76commissioned,  gear,  generate,  mixed,  conclude,  credit
124what, What,  What, what
133equally,  most,  deeply,  relatively,  greater, more
166civil,  loan
183because,  still, although,  Because,  since,  although
191Cl,  Sn,  L,  Le,  Mes,  Mon
206New,  New,  popular,  ',  old,  handsome
236L,  l,  O, .,  unl,  Fl
Layer 2
DimensionTop Input Tokens
26!", ", ...", "., '.
88Yes, clusively, iably,  vertically,  right
96What, What,  How, what,  what,  how
154US,  Americas,  Netherlands,  Massachusetts,  States, bourg
158presidents,  pilots,  Scholars,  founders,  Ts,  Doctors
171you, 'll, ),  will,  we,  if
185They,  they,  she,  he
243iless,  prohibiting,  custody,  needs,  permission
269impressive,  vast,  cultural,  sports,  musical,  great
461sites,  facilities, une,  board,  School,  Jo
Layer 3
DimensionTop Input Tokens
79Nik,  Ir,  Two,  Poland,  Pol,  spectacular
153biological, iga
156attracted,  rescued,  confined,  trouble,  provided,  avoided
167ft,  Lis, bo, ifer,  Loren
244(, 6, 5, 3, 7, 4
349Ċ, ard, ifer, ruct, ively, stra
50732,  1950,  Pole, ple, isation, number
714Anto, controll,  along, ri, waters, rans
779Cro, stra,  Cron,  Bar,  Knowledge,  Crick
811bar, lang, rio,  McC, oph, off
Layer 4
DimensionTop Input Tokens
114Q,  unequal,  Gulf,  Tenn,  extr,  GDP
171ours,  various,  instantly,  exact,  technically, Ċ
213och,  Walt,  corner,  length,  composition,  dose
229och,  Little,  mention, ot, af, /
266A,  15,  atomic, Ċ,  official,  My
386Dec,  Rod,  send,  Cron,  catar,  tou
408grant,  Priv,  genuine,  absolute,  typically,  legally
472smell,  Jupiter,  auditory,  thinkers,  Venus,  razor
547Dec
647och,  length,  dose
Layer 5
DimensionTop Input Tokens
83penetrate, ensory,  breathe,  bites,  distract, end
291fats,  sequences, ats, who,  miracles, isions
367deepest,  official,  perfect,  atomic,  presidential, digit
4442, 1, 6, 3, 4, 7
556Cash,  Hillary, Q,  Bond, go,  Tea
560becoming
5672, 3, 1, 4, 6, 5
587Return,  atomic,  Person,  official,  composed, room
594stayed,  although,  lacks, although,  poorer,  It
646Be, &,  che,  Che

Full model results in footnote.[5]

In theory, these are all of the features represented in Pythia 70M's residual streams when these activations were collected. If the technique were extended to a representative dataset and to every Pythia sublayer, you'd in principle enumerate every single concept in Pythia.

Empirically, layers  and  (the two residual spaces right after the embedding layer) are the most interpretable of the bunch. Later layers are more garbled, though some clearly meaningful dimension exist there too.[6]

Note that the interpretability method used on the autoencoders—top-k tokens in the prompt—is relatively naive. I have code for activation heatmaps and direction ablations[7], and those interpretability techniques may capture meaning that top-k tokens misses. Any interpretability technique you have for model neurons... can be applied to sparse autoencoder neurons too.

Llama-2 7B

The above results are my independent replication of the the MATS 4 Lee Sharkey team's Pythia sparse coding. What if we scale the technique? Targeting a layer similarly early in the model, we train an autoencoder on Llama-2 7B:

Layer 13
DimensionTop Input Tokens
34▁All
1092, 3, 2004
120<s>
127▁England, ▁dollars, ▁Italian
206▁means, ▁refers, ▁composed, ▁learned, ▁hid, ▁she
207▁society, ▁portal, ati, unker, ▁Order, ▁mission
253▁said, ▁wrote, ▁designed, ▁statement, ▁directed, elled
277▁dan, ▁po, ▁dess, ▁Know, ▁conce, ▁Har
328<s>
331▁program, ▁intelligence, ▁computer, ▁artificial, I, ▁Rob

Full layer results in footnote.[8]

 seems too low for the autoencoders trained on Llama-2 7B. These Llama-2 results are instead at .[9] Still better interpretability results could be obtained if this range of sparsity values was better explored.

Neuron Interpretability Baseline

If you directly interpret model neurons on Llama-2 7B using the top-k technique, your results look like this:

Layer 13
NeuronTop Input Tokens
0▁Rafael, ▁animation, ovo, ▁beneath, ▁commun, ▁Cross
1▁Hero, emor, action, ▁Indones, ▁expedition, immer
2▁bus, ▁Sund, ▁top, ▁marriage, ander, ▁breakfast
3▁predict, ▁Ald, ▁phase, ▁overcome, rin, ▁Joy
4related, ▁lazy, round, ▁Nev, UI, ▁atmosphere
5▁trans, gu, isted, ▁portal, ▁tiny, laimed
6ija, ▁Chief, ▁measures, ▁valuable, space, ▁testing
7ond, ▁lazy, ▁Virgin, tes, ▁conquer, ▁uniform
8▁Valley, ctions, round, ▁measures, ▁facilities, ▁variable
9▁ways, ▁definitely, isation, ▁elements, enta, ▁expl

Path to Impact: Learning Windows into Models?

Epistemic status: Wild speculation.

The above suggests that we can train windows into each layer of a model. Each autoencoder window tells you what's going on at that layer, in human-comprehensible terms. The underlying forward pass is unaltered, but we know what concepts each layer contains.

Because you know how those concepts are mapped out of the model into the autoencoder, they are also ready to be added in through activation engineering! So you already have some interpretability and steering control.

More ambitiously, we can now try to reconstruct comprehensible model circuits. With ablations, see which features at layer  affect which features at layer . Measuring the impact of features on downstream features lets you build up an interpretable "directed semantic graph" of the model's computations.

This especially is really good stuff. If you can reconstruct the circuits, you can understand the model and retarget its search algorithms. If you can understand and align powerful models, you can use those models as assistants in yet more powerful model alignment.

Conclusion

I've replicated prior sparse coding work and extended it to Llama-2 7B. I'm hoping to keep at it and get results for Llama-2 70B, the best model that I have access to.

Generally, I feel pretty excited about simple modifications to model activation spaces as interpretability and steering techniques! I think these are worth putting points into, as an independent alignment bet from the RLHF.

  1. ^

    I was specifically hunting for a "truthiness" activation addition to move around TruthfulQA benchmarks. (I am unsure whether the techniques covered in the post are, in-practice, up to programatically isolating the "truthiness" vector.)

  2. ^

    Or to an AI assistant helping you interpret neurons in a model.

  3. ^

    Also known as "sparse dictionary learning."

  4. ^

    Underlying Pythia activations were collected during six-shot TruthfulQA. (Six shot is standard in the literature.) This is a far smaller dataset than The Pile, so this was also an experiment in small dataset sparse coding.

    I project to a -dimensional space from Pythia's -dimensional activation space. Negative token activations are excluded, since the ReLU would zero all of those out—destroying any information negative values might contain.

    So, directions with all negative values are dropped—notice that that's most directions! Only about  in  are kept.

  5. ^

    Pythia 70M Autoencoder Data

    Layer 1

    [Dimension] [Top Input Tokens]
    2  holding,  speak,  remember,  read,  learn,  hears
    11 :, )?
    76  commissioned,  gear,  generate,  mixed,  conclude,  credit
    124  what, What,  What, what
    133  equally,  most,  deeply,  relatively,  greater, more
    166  civil,  loan
    183  because,  still, although,  Because,  since,  although
    191  Cl,  Sn,  L,  Le,  Mes,  Mon
    206 New,  New,  popular,  ',  old,  handsome
    236  L,  l,  O, .,  unl,  Fl
    254  been,  be
    286 The,  The,  decreased,  unclear,  higher,  decrease
    313  month,  months
    393  stunt,  pylori,  psychological,  penal,  methodology,  punished
    455  You, kur, You,  you, (
    509 vis, kur, bron,  butter, ater, iele
    612 that
    641  Marc,  Justin,  Jonathan,  Milton,  Jeff,  Moz
    675 high
    708 Ċ,  over,  getting,  taking,  pushing,  coming
    728 into
    733 ll,  cliff,  course,  should
    816  University,  City,  universities,  Airport,  Harvard,  campus
    859  Milky,  gum
    861  The, The,  Three,  Two,  Our,  His
    986  ign,  dens,  gy,  acupuncture,  undergraduate
    989  handsome,  Johann,  deeply,  originated,  disguised,  hungry
    1051  knew,  worked,  tells,  get,  knows,  won
    1138  Millenn,  UFO,  Gandhi,  Herman,  Disney,  Smith
    1148  Real,  vamp,  Mant,  Ch,  real,  mat
    1176  salts,  pesticides,  mushrooms,  spiders,  fluids,  fertil
    1182 fra,  schizophren,  Jedi, kur,  catar,  Hitler
    1201  deeply,  career,  critically,  psychological
    1210  Can,  can,  did,  Do,  Did,  could
    1229 st, gs,  never, bra, 't, ieri
    1339  makes,  make,  making,  made,  How,  how
    1387  I, I
    1423  who, who,  Who,  which,  where
    1452 Q,  U
    1472 orc, father,  Dam,  Neumann,  Auto, arth
    1484  horses,  has,  have,  adolescence,  burgers,  ribs
    1540  Algebra,  databases
    1595  Toronto,  Madrid,  Munich,  Dublin,  Paris,  Barcelona
    1612 Nobel
    1647  what,  something
    1652  family,  US,  their,  national,  parents,  mothers
    1699  are,  aren,  Are,  is,  were,  Were
    1724  Iceland,  Finland,  Ireland,  Poland,  Switzerland,  Italy
    1725  turkey,  hunting,  salad,  nausea,  meat,  transportation
    1861 2,  2,  Trans,  Cre, bra,  Two
    1864 "., '., "?, !", '?, ?"
    1868  DNA,  hair,  monkey,  gun,  palm,  doll
    1878 6, 7, 9, 8, 12, 13
    1965 What, what,  What,  what
    1997  all,  turn,  turned,  All, All,  both
    2000  lead,  La,  Flight,  passenger, ke, ib
    2024  that,  a
    2125  On,  on,  P, on,  R,  Ch
    2136  getting,  takes,  get,  taking
    2144  1970,  1950,  1990, 1990,  II, clusive
    2165 lying
    2233  detox,  patrol,  extras,  dishon,  massacre,  purge
    2247  your,  Your,  my,  our,  his,  Our
    2352  What,  Nobel, What,  How,  what,  Why
    2427 ?, '?, )?, "?, .?, ?"
    2438 ensory, rugu, N, 3, rist, carb
    2505  comment,  specify,  like
    2509  humans,  Canadians,  Australians,  Americans,  Iceland,  Europe
    2568 A,  A
    2580  people,  thinkers,  everyone,  participants,  People,  Americans
    2610  chili,  purple,  pink,  pepper,  Yes,  dessert
    2679 What
    2719  traditional,  legal,  organized,  alloc,  accessible,  legally
    2728  90,  86,  twenty,  13,  heavens,  12
    2729 shouldn
    2764  Swift,  Harvard,  York, Mind,  Mex, fat
    2765  everyone,  every,  Everyone,  across,  many,  Many
    2814  whether,  or,  where,  unless,  and,  When
    2825 imps, igm, ringer, dig, recogn, uj
    2941 Big,  mathematical,  Neural, New
    2955  It,  it, it,  All, All,  all
    2995  more,  less, more,  More,  fewer,  harder
    3021  used,  summoned,  displayed,  removed,  accessible,  useful
    3071  doesn,  don,  didn,  shouldn,  Barack,  Ad
    3104  Economics,  Knowledge,  Diet,  Med,  Psych, iology
    3131  Does,  Did, 's,  Do,  does,  Which
    3132  Jacksonville,  Indianapolis,  University,  Paso,  Angeles,  Carolina
    3149  humid,  directly,  criminal,  penal,  dishon,  bankruptcy
    3160  San, San,  New, New,  Sant,  Carn
    3163  other,  Which
    3175 Earth
    3202  What, What,  How,  what, what,  McC
    3296 :,  ",  all,  your
    3316  ",  '
    3365  illegal,  legal,  legally,  human,  Legal
    3379 S
    3380  Marl,  boil,  If,  melt,  struck, If
    3458 5,  5,  five
    3468  immune,  unequiv,  payment,  proportion,  millions,  billions
    3476  who, who
    3578  actor,  scientist,  lawyer,  engineer,  sailor,  artist
    3584 Q,  bl,  I,  Black, If,  if
    3600 ll,  will,  Will,  would,  By,  by
    3634 and
    3650  weather,  sun,  snow, Snow,  cold,  rain
    3658  ancestor,  father,  kidnapped,  witch,  husband,  assassination
    3685 In,  in,  In,  During,  during,  along
    3769 away,  away, atorium, work
    3808  home,  father,  US,  childhood,  house,  parents
    3826  swim,  rib,  tie,  doll,  wave,  stretch
    3968  Barack,  Bill,  Hillary, Bill,  George,  Michael
    3987  the, the, The,  The,  vom, 2
    4057  than,  Than,  like,  1960,  1961,  as
    4094  shown,  showed,  demonstrated,  showing,  show,  shows
    4226 )?, '?, ?, ifer, inc, itable
    4236  taking,  take
    4251  there,  There
    4265  consistently,  wildly
    4302  gluten,  steak,  salmon,  burger,  chicken,  straw
    4315  Way,  Tw, ),  mist,  Witch,  lying
    4334  Jenny, recogn, rico,  Jonathan, uj, ima
    4359  know,  knows,  knew
    4368  Prize,  word,  phrase,  result,  periods,  period
    4383  Gal, S
    4442 :
    4512  1990,  1960,  1950, 1990,  1970,  2000
    4554 rugu,  Denver,  Miami,  Washington,  Vancouver,  Luis
    4726 (
    4729  literally,  only,  just, clusively,  secretly, Only
    4762  smallest,  brightest,  best,  richest,  largest
    4789  flawed,  tiny,  burned,  impressive,  harder,  excessive
    4804 ,,  not, 't, not,  originally
    4808  The, The,  the,  That, the
    4842  always,  commonly,  remains, inally, cos, inc
    4865 Ċ
    4887  Ital,  Az, Ins,  intellig,  Mex,  Hind
    4902  No,  no,  Not, no,  Nothing,  Little
    4954  do,  Do,  does,  Did,  numbers,  real
    4966  best,  take,  taking,  good
    4996  decades,  Gates,  Way,  II,  Clinton,  years
    5025 (,  (,  Alban,  How,  Cran,  Massachusetts
    5036  oil,  breastfeeding,  alive,  smoke,  living,  women
    5041 ve
    5050  have, 've,  had,  Have,  has
    5052  a,  an,  What, What,  You,  With
    5062  you,  You,  only, You,  just,  Only
    5106  cl,  fro,  fl,  gr,  ch,  merc

    Layer 2

    [Dimension] [Top Input Tokens]
    26 !", ", ...", "., '.
    88  Yes, clusively, iably,  vertically,  right
    96  What, What,  How, what,  what,  how
    154  US,  Americas,  Netherlands,  Massachusetts,  States, bourg
    158  presidents,  pilots,  Scholars,  founders,  Ts,  Doctors
    171  you, 'll, ),  will,  we,  if
    185  They,  they,  she,  he
    243 iless,  prohibiting,  custody,  needs,  permission
    269  impressive,  vast,  cultural,  sports,  musical,  great
    461  sites,  facilities, une,  board,  School,  Jo
    463 ll,  will,  would,  might,  should,  could
    574  I,  i, I
    592  nothing,  Nothing
    593  heart,  world,  COVID,  cancer,  Christ,  body
    594 In,  In,  in
    665  stunt,  rule,  block,  triggered,  notice,  transform
    705  hasn,  continue,  won,  keeps,  doesn,  stops
    760  and,  but,  then,  eventually, although,  various
    808 that
    812 tal,  impressive, ,,  notable,  asking,  its
    858  blood,  Av,  birth,  University,  healthcare,  uterus
    870  reading, Fe,  Tele, Ind,  pre,  From
    958 1961
    987 19
    1050  People,  Pres,  Men,  people,  gu,  Humans
    1190  multid,  purpos,  carn,  catar,  incon,  unl
    1228  ancestry,  founder,  alumni,  citizens,  father,  personalities
    1230  col,  South,  Mill,  Ital,  Ge,  College
    1243 izen, ija, pro, &, bron,  1996
    1246  scholars,  citizens,  Democratic,  prosecutor,  personal,  community
    1256  Sugar,  cuisine,  Iron,  Fire,  Food,  Light
    1297 icides
    1303  Allied,  national,  domestic, rious,  Democratic, ied
    1332  your,  you,  Your,  You,  yourself, You
    1371 Qu,  Qu, uff,  Pink, IK, inj
    1419  sing,  ducks,  dancing,  golf,  rugby,  chocolate
    1421 acting, Qu, aupt, acking, fat, lim
    1428  most,  largest,  best,  Most,  closest,  biggest
    1441 ), 9,  (, 6, 5, 8
    1474  terrible, iless,  someone,  coworkers,  crimes,  a
    1488 ?, '?, "?, )?, ?", .?
    1510 onna, rico, clamation, anca,  Auto, oston
    1562 etics, icks, Mind, ences,  thinkers, ens
    1568 ois, ais, ela,  Amy, aqu, au
    1571  nothing,  effort,  consensus,  obligation,  Species
    1721 ails,  bites,  razor,  rifle,  tricks,  strikes
    1771 that
    1822  camel,  wolf,  Canadian,  witch,  tar,  lawyer
    1831 1,  used,  1
    1905  Yes,  What,  Can, Q,  No,  Prize
    1911  your,  my,  her
    1926  mankind,  crimes, space,  mentally,  officers,  brain
    1985 3, 2, 13, 7, 6, 9
    2052 (, '?, 10,  Yes,  not, 8
    2088 S, K
    2102 A,  A, An,  E,  An, E
    2120  ost,  nine,  80,  ra,  330,  yards
    2129  there,  There, ucer,  covered,  series,  coming
    2170  road,  sky, seat,  attractions,  film,  pavement
    2255  Puerto, 6,  Denver,  Vancouver,  Luxem,  Miami
    2259  Fund,  cost,  restrictions,  costs,  powers,  batteries
    2340  Way,  Valley, ),  Massachusetts,  Nevada,  Angeles
    2347  Declaration,  International,  Commonwealth,  national,  Cre,  The
    2411 5, 7, 6, 9, 8, 13
    2415  pillar,  object,  Angel,  Area,  circle,  Venus
    2471  band,  solo, rans,  canon,  penal,  electrical
    2519  living,  nearly,  unanimously,  expecting,  original,  just
    2548  ',  ", -,  per
    2621  Friedman,  labor, riminal,  republic,  politician,  Witch
    2642  no,  No, Ċ, rit, 't,  unlikely
    2749  be,  a,  unusually,  necessarily,  is,  an
    3020 :
    3049  spiritual,  Black,  Arab,  Hindu,  Ital,  biological
    3052 akes, idden, sea,  Go,  Dreams,  dream
    3068  there,  All, All,  Everyone,  There,  Have
    3097  67,  330,  94,  58,  44,  variable
    3117 if,  if,  If, If
    3123  You, You,  you
    3129  there,  There,  happens
    3149  (, U,  originally,  Future,  plans
    3211  used,  intended,  structured,  learn,  transported,  marry
    3275 estion,  Breast,  Honey,  infection, isexual,  Nut
    3324 If,  If,  if,  unless, if,  When
    3357  Kn,  Sn,  Tar,  Cr,  Sha,  Sant
    3422 orc, tec, amation,  injured, inkle,  evil
    3448  Norris,  valuable,  Bob,  Be,  no,  Col
    3459 ro,  third, one,  Most,  characteristic,  hypothetical
    3508  It,  it
    3514  blind,  hood, the,  gun,  tort,  cat
    3522  Space, iology,  speech,  Economics,  bricks, waters
    3563  than,  Than
    3602 rapy, izers, cards, uncture, illation,  Way
    3611  modeled,  achieved,  grant,  asking,  led,  lets
    3686 won
    3830  Brian, Bill,  Jeff,  David,  Robert,  James
    3838  Prize,  Nobel
    3851 istic, ormal,  unequal, izers, otion,  analogous
    3874  happen,  happens,  happened
    3894  By, uously,  No
    3936  All, All,  all,  Everyone,  everything,  everyone
    3939  has,  have,  claimed,  hasn, iably, 've
    3945  1990,  1970,  weeks,  1981,  2001,  1950
    3960  Way, inkle, well, inally, hr, beth
    3979  Az,  reign,  plan,  Pon,  tar,  ra
    4051  can,  could,  cannot,  may,  Can,  must
    4056  analogous, istic,  impressive, ined, rious,  affordable
    4066  analogous,  ideal,  am,  devoted,  unlikely
    4082  qual,  silver,  gall,  chocolate,  olive,  chess
    4152  boo,  try,  agree,  speak,  love,  notice
    4270  on,  onto,  On,  across,  against,  via
    4305 ",  New, !", New,  inaugural, "-
    4368 Q
    4382  mean,  estimated,  demonstrate, probably,  describes,  unlikely
    4498 ?, )?, "?, '?, .?, '.
    4628 What,  insulin,  salmon,  What,  butter,  oil
    4673 Q, What,  What,  did, Qu,  does
    4782 graph, Cap, :, ham,  (, rop
    4788  Sn,  Bl,  P,  p,  T,  pe
    4885 ,
    4892 by
    4907 Q
    4911  Quebec,  Massachusetts,  Toronto,  Dublin,  Paris, ivia
    4925 hasn
    4928 ?,  optimization, '?
    4941 &, illary, ering,  Dec
    4943  Sundays,  weekends,  minute,  evening,  sky,  midnight
    4987  obligated, istic,  impossible,  unlikely,  problem,  idea
    5104 An,  These,  Their,  They,  Only,  involves
    5106  Who, What,  What, what,  what,  Where
    5107  No,  impossible,  no,  unlikely,  Nothing, t

    Layer 3

    [Dimension] [Top Input Tokens]
    79  Nik,  Ir,  Two,  Poland,  Pol,  spectacular
    153  biological, iga
    156  attracted,  rescued,  confined,  trouble,  provided,  avoided
    167 ft,  Lis, bo, ifer,  Loren
    244 (, 6, 5, 3, 7, 4
    349 Ċ, ard, ifer, ruct, ively, stra
    507 32,  1950,  Pole, ple, isation, number
    714  Anto, controll,  along, ri, waters, rans
    779  Cro, stra,  Cron,  Bar,  Knowledge,  Crick
    811 bar, lang, rio,  McC, oph, off
    905  order,  Little,  exagger, U,  atmosphere,  sand
    932  there,  There,  lots,  You,  no,  It
    946  spoken,  Bach, Mar, Cap,  Dec,  modeled
    976  Cra,  Ber, aff,  Bach, ign,  Er
    1119 Q, izers, pen, bar, fe, oux
    1140 1, 2, ak,  unusual,  Story,  upon
    1176 Q
    1217 cards,  biological
    1230  moment,  career,  normal,  position,  condition,  tendencies
    1247  cooler,  II,  taxed,  sad,  bars,  decrease
    1408 :
    1605 ney, we, vere, ana,  Loren, rio
    1632 uther, ind, onna, lock,  Declaration, ler
    1637  colonial,  Hollywood,  Asian,  Indonesia,  Portuguese,  Florida
    1750  abundant,  useful, known,  affordable,  basic, cards
    1774 20, Ċ,  Related,  Rum,  thirty,  unequiv
    2008  million,  anymore
    2016  leg,  mist,  mant, ble,  watches, suit
    2031 immer, yl, away,  across, iz, wart
    2192 lang, bron
    2451  bites,  brush
    2455  Golden,  Elvis,  Solar, Steve,  ice,  chocolate
    2541  blocked,  parchment,  cocaine,  permission
    2588  highest,  position,  Declaration,  not
    2604  arms,  swe,  boss,  alcohol,  gum,  chairs
    2610  abdomen,  heavens,  mankind, sts, ano,  further
    2636 ll,  will,  should
    2657 S, rugu,  Belgium,  Italy,  Greece, K
    2658  Yellow, oused,  preferred
    2689  Yes,  No,  If,  Unknown,  There,  Only
    2720 Q,  Can, (,  eats,  tons,  love
    2728  Happ, of, law, app, war,  lots
    2780 rain,  head,  score
    2803  trans, path,  officers,  ceremonies, rop,  pilots
    2943  accept,  interrog,  teach,  predict,  inflict,  save
    2978 Lis
    3019 aro, iga,  Cape, asses, more
    3095 C, Tw, enth, na, ch, ISA
    3128  stove,  River,  cord,  investor,  bird,  Tri
    3213  Er,  biological,  Europeans,  AI,  em, brates
    3231  Rec,  Bel,  Ac,  Sch,  Ad, uther
    3281  inflation, ports,  yellow,  pneumonia,  video,  thirty
    3421 ual,  Col,  collected,  credited, und,  obligated
    3556  Can, Q,  Antar,  Have,  shower,  reb
    3632 aff, rapper
    3675 otics, ella, uit, ilis, icorn, ija
    3717 pp, ater,  ent,  responsible, aro,  refer
    3723  lazy, dig, cre,  talent,  skilled,  confined
    3744  kind,  weeks,  thirty, 1000,  backwards,  happens
    3799  Carl,  Bryant,  Holmes,  Freud,  Cunningham,  Curry
    3820  strikes,  such, rans,  strike,  hid,  when
    3873  less,  more,  decreased,  stayed
    3881  Tiger,  Pink,  Fire,  Sugar,  Birds,  Rich
    3901  than,  Than
    3956  U, tw,  Ang, 1990,  Lanc, Po
    4000  Pole,  atmosphere,  Building,  disorders,  pregnancy,  yours
    4054  unusual,  approved, named,  thousand,  several,  getting
    4140 A,  Allen,  An,  A,  Di,  com
    4157  describes,  whether,  third,  Because,  statements, ey
    4181  How,  Where, bla,  Known,  reduce, what
    4310 &
    4393 13, 3, 12, 5, 1, 4
    4426 aqu,  Ber, Mar, enn, oge
    4435  Queens, Real,  1961,  NY,  2003,  Trans
    4443  The,  Kon, The
    4469 ), :, illi,  Viol,  Spot, lim
    4530  relatively,  accessible
    4589  the,  your
    4616  recorded,  notable,  existing,  basic,  several
    4670  deprive,  historically,  recently, FK, shit, Bill
    4702  (, oph,  Rh,  Dec
    4709 bla, iga
    4724 ?, "?, .?, ?", '?,  upset
    4767  Despite,  Does,  unusual,  Do,  Did,  What
    4881  Sea,  grasp, Cap,  record, angle
    4956  tells, &,  contributes,  hasn,  comes,  came
    5021 Q,  Can
    5076  separately,  action,  grid,  lasts,  cleans,  plot
    5081  drops,  keeps,  reduces,  improves,  provides,  increases
    5108  minute,  average

    Layer 4

    [Dimension] [Top Input Tokens]
    114 Q,  unequal,  Gulf,  Tenn,  extr,  GDP
    171  ours,  various,  instantly,  exact,  technically, Ċ
    213 och,  Walt,  corner,  length,  composition,  dose
    229 och,  Little,  mention, ot, af, /
    266 A,  15,  atomic, Ċ,  official,  My
    386  Dec,  Rod,  send,  Cron,  catar,  tou
    408  grant,  Priv,  genuine,  absolute,  typically,  legally
    472  smell,  Jupiter,  auditory,  thinkers,  Venus,  razor
    547 Dec
    647 och,  length,  dose
    946 Q, ais, (,  smash, pir, iele
    1158 28
    1607 6, 5, 7, 3, 1, 4
    1635 length
    2327 digit,  deepest,  abundant,  official,  perfect, icking
    2448 ais, St,  arriv,  even,  pushing, ous
    2747 ais,  navig, ag,  Sov,  Kore,  Y
    2989 och, ala,  Knowledge,  participants
    3048  15,  atomic,  tells,  obese,  undercover,  Yellow
    3265 Q
    3379 (
    3655 âĢĻ
    3829  unanimously,  miserable,  Fox,  absolute,  Imm,  deepest
    3870  smash,  prick,  learned,  lets,  extend,  imagine
    4061 ),  Pink,  Rich, ali,  Most,  Carolina
    4083  lots,  length, och, af,  corner,  dose
    4090  directed,  Franklin, elson, ek,  Fleming,  Auckland
    4279 ais, St,  La,  Lav,  Gal,  Ost
    4524  absolute,  Little
    4624 af, ath
    5079 och, 12, 9, 8, 6, 10

    Layer 5

    [Dimension] [Top Input Tokens]
    83  penetrate, ensory,  breathe,  bites,  distract, end
    291  fats,  sequences, ats, who,  miracles, isions
    367  deepest,  official,  perfect,  atomic,  presidential, digit
    444 2, 1, 6, 3, 4, 7
    556  Cash,  Hillary, Q,  Bond, go,  Tea
    560 becoming
    567 2, 3, 1, 4, 6, 5
    587 Return,  atomic,  Person,  official,  composed, room
    594  stayed,  although,  lacks, although,  poorer,  It
    646  Be, &,  che,  Che
    674  P,  p
    733  Did,  Can,  Should,  Does,  Was,  Is
    758 Q,  jet
    790  pent,  Miranda,  Middle, St,  Ex,  Pil
    982  accumulated, asses
    985  approximately,  below,  Wait,  lasts,  wait,  7
    1081  several,  seven,  13,  5,  hundred,  six
    1090  Theorem,  founder, root,  Wizard,  root, cil
    1258  deprive,  warn,  invoke,  leaving,  causes,  discovered
    1418  What, what,  what,  nothing,  7, What
    1492  ind, San
    1592 Tar
    1644  F,  ch,  C, H, F,  sc
    1665 pp, abl,  Clock,  ey, erv,  text
    1695 :,  How, aqu, icking,  Ger, )
    1893 extr
    1963 (,  (,  cl,  entr,  November,  Gal
    1996  undergo,  temporary,  high,  permanent,  lacks
    2007  What,  How,  highest, All,  various, icking
    2185  temporary,  permanent,  invoke,  feel,  simply,  uterus
    2262 Ċ, Q, "., '., .", !"
    2340  except,  United,  Great,  visiting,  refers,  Luxem
    2361 aqu, St,  acc,  Be,  Puerto,  mentally
    2445  instantly,  slowly,  should,  could,  immediately,  drank
    2453 izen,  becoming,  Be,  Qu,  NY,  gy
    2495 ayan,  calorie, asses,  powdered,  soft,  accumulated
    2730  prep,  meat,  jail, inner,  Yoga,  Toast
    2755 delicious
    2803  presidential,  national,  Federal,  conservative,  Cod, fly
    2810  laughter,  helium,  Arizona,  atmosphere,  extinct,  lungs
    2858 iss, iv, urop,  gr, isc, ruct
    3001 ?, ?, "?, .?, )?, '.
    3195 ented,  Has,  deprive,  Did, acting,  warn
    3233  analyzing,  When,  receive,  How, :,  Where
    3428  U,  I, I,  O,  s,  i
    3477 rico,  Blake,  Justin,  Albert,  Jeff,  Charles
    3493 1, 2, 3, 4, 6, 5
    3504 Q,  able,  accurately, 2, iele,  refers
    3528  speaks,  visiting,  except,  conquered,  In,  vs
    3564  Person, Return,  end,  rest,  Theorem,  list
    3576  navig, Ins, ais,  Pri,  Priv, icking
    3688  ch,  Ch
    3767 Ċ
    3944 ais,  Tar,  absolute,  Gal, Ins,  Ther
    4072  banned,  outlaw,  cars,  accepted,  originated,  scores
    4085 2, 3, 1, 4, 6, (
    4114  Asian,  oldest,  cultural,  Trans, New,  aster
    4228 work
    4274  No,  comment,  Yes,  unclear,  definite,  conclusive
    4560 ais,  Priv, Ins,  Tar,  aster
    4569 (,  (, Q,  Type,  Gen, Dr
    4597  highly,  openly,  becoming,  necessarily,  Ger,  unusually
    4649  necessarily,  highly,  totally,  entirely,  relatively,  unusually
    4688 Return,  Person,  official,  Theorem,  lots,  rest
    4807 including, .?, El,  Qu, Real,  Mer
    4827  organized, uffs, fed,  meat,  decreased, ella
    4854  Person, âĢĻ,  rest,  Theorem
    4874 phants, ats,  fear,  girls,  Scientists,  pigs
    4937  rabbit,  delicious,  living,  praying,  electric,  official
    4983 fat,  Democratic,  Most,  conservative,  educational,  Ger
  6. ^

    My experience with the bigger models leads me to think that, plausibly, better results for those other layers could come from different sparsity values. That is, maybe, there isn't a single best sparsity for all layers of a model.

  7. ^

    Heatmap code courtesy of Alan Cooney's CircuitsVis library.

  8. ^

    Llama-2 7B Autoencoder Data

    Layer 13

    [Dimension] [Top Input Tokens]
    34 ▁All
    109 2, 3, 2004
    120 <s>
    127 ▁England, ▁dollars, ▁Italian
    206 ▁means, ▁refers, ▁composed, ▁learned, ▁hid, ▁she
    207 ▁society, ▁portal, ati, unker, ▁Order, ▁mission
    253 ▁said, ▁wrote, ▁designed, ▁statement, ▁directed, elled
    277 ▁dan, ▁po, ▁dess, ▁Know, ▁conce, ▁Har
    328 <s>
    331 ▁program, ▁intelligence, ▁computer, ▁artificial, I, ▁Rob
    336 ▁foot
    392 ▁nin, ▁did, ris, ▁ugly, ▁differ, ▁por
    416 ▁except, ▁and, aria, ries, ▁Bel, ▁vs
    444 ▁few, ▁Many, ▁Very, ▁unlikely, ▁fewer, ▁Most
    527 ▁high, ▁college, ▁graduated, ▁school, ▁finish, ▁teachers
    629 ▁grown, ▁presented, ▁without, ▁moved
    666 ▁stayed, recogn, ▁consist, ▁same, ▁stay, ▁equally
    667 A
    703 ▁entrepr, XT
    774 ▁diam, ▁ugly, ▁por, ▁vision, ▁artists, ▁news
    820 ▁pushing, ▁hide, ▁hid, ▁lying, ▁telling, ▁inform
    823 ads, won, urus, ▁boys, ▁Sib, ws
    842 ▁particular, ▁Nothing, ▁nothing, ▁happens, ▁happen, ▁anything
    863 ▁happy, ▁prosper, ▁hun, ▁experience, ▁stub, ▁will
    867 ▁levels, ▁accum, ▁blocked, block, ulated, ▁waves
    904 ▁expect, ancy, ▁extend, ▁growth, arter, ▁gain
    941 <s>, <0x0A>
    1114 ▁after, After, ▁August, atra, ▁War, ▁began
    1146 ▁position, ▁rate, ▁link, ▁phrase, ▁sound, ▁purpose
    1200 ▁its, ▁tries, fr, ▁Its, ▁Har, ▁national
    1221 ▁In, ▁in
    1354 ▁planet, ▁solar, ky, ▁Earth, ▁Sol, ▁System
    1408 ▁name, ▁named, ▁called, ▁height, amed, ▁friend
    1522 ▁shorter
    1705 ▁Diet, ken, olate, father, orie, can
    1728 ording, ▁marry, aking, ▁accept, hing, ▁hitting
    1730 rial
    1735 anned, wed, ▁cens, ▁still, ▁remain, ▁ban
    1739 ▁figures
    1787 <s>, Q, 7, 6, 8, ▁Why
    1804 ▁Philadelphia, ▁Paris, HT, ▁ha, ▁Rome, ion
    1834 ▁examples, ▁example, ▁some, ▁Notable, ▁characteristic, ▁cases
    1940 ▁Theorem
    1949 ▁no
    2063 <s>
    2100 pie, ▁single, enta, ▁orange, ▁minute, ruit
    2128 ▁leave, ▁stick, ▁suspect, ▁draw, ▁sees, ▁disturb
    2233 ▁among, ▁case, ▁if, ▁aid, ▁contribute, ails
    2252 ▁speak, ▁wore, ▁am, ▁accept, ▁holding, ▁recommend
    2268 ▁interesting, ▁Person, ▁Time, ▁Year
    2443 ▁hid, ▁film, ▁Grand, Cast, ▁Cost, To
    2455 ▁foot, ters, ▁horn, iums, ▁scales
    2511 rac, hard, umann, ems, hner, fe
    2527 ▁dogs, ▁positive, ▁verte, ▁prime, br, ▁Christians
    2612 ▁Stars
    2648 ▁It, ▁it, ▁They, ▁him, ▁dist, ▁they
    2708 ▁Sm, ▁Video, ▁Crit, ▁Organ, ▁Disc, ▁Le
    2792 ▁eyes, ight, ▁battery, ▁fingers, ▁damage, rain
    2856 ining, ▁stops, ▁always, ▁forever, ▁never, ible
    2976 ▁purchase, ▁obtain, ▁add
    3020 ▁Why, ▁Who, ▁Where, ▁What, ▁Which, ▁How
    3029 ices
    3114 ▁used, ▁crashes, ▁spent
    3227 avia, ▁Dutch, oa, ians, ▁Indians, ests
    3258 ▁phen, FO, ormal, ▁ESP, ition, ▁medium
    3324 ▁well, ▁add, uent, ▁numbers, ▁talk, ▁accomplished
    3342 ▁Nobel, riz, ▁Prize, ure, ▁Theorem, ▁Olympics
    3354 %., cer, ya, ., )., ▁determ
    3490 5, ▁entrepr
    3516 ▁player, ▁greatest, ▁basketball, ▁popular, ▁desert
    3598 ll, ▁ticket, ▁would, ▁license, ▁need, ▁must
    3599 ▁only, ▁located, ▁lets, ▁refuse, ▁contain, Only
    3611 ▁fans, ▁Christians, ▁Only, ▁good, ies, ons
    3671 ▁designed, ▁started, ▁Who, ▁founder, ▁invent, ▁first
    3756 digit, ▁atomic, ▁double, ▁risk, ▁prime, ▁official
    3823 ▁destroyed, ▁ax, pped, ▁cho, ▁lifted, ▁attacked
    3843 ▁restart
    4000 <s>
    4027 ▁particular, ▁happens, ▁happen, ▁ways, ▁aspects, ▁injured
    4061 <s>
    4065 ▁best, ▁favorite, icious, imate, ite, ▁greatest
    4087 ▁eight, ▁five, ▁thirty, ▁several, ▁seven, ▁three
    4106 <s>
    4309 XT
    4426 ▁creation, ▁board, ▁campaign, den, ▁move
    4452 ector, activity, ▁meters, ▁skills, can, una
    4460 ▁as, ▁well
    4478 can, ▁convention, cy, ests, las, ucha
    4483 ▁smaller, ▁larger, ▁rich, ▁Rich, ▁poor, ▁pover
    4573 enda
    4576 ▁right, board, ▁Last, ▁girls, ▁Rem, ▁Er
    4593 ▁(, A, ▁easiest, ▁tells, ▁personally, Q
    4617 <s>, ▁proofs, ▁varied, ▁accessible, ene, ▁distinct
    4671 ▁stayed, ▁keeps, ▁stay, ▁keep, ▁continue, ▁consist
    4743 aked, ▁flat, olen, aged
    4748 ▁seat, at, ▁back, ▁side, ▁lap, ▁bus
    4766 isons, ▁Greece, uto, ▁contribute, ▁twenty, ▁sing
    4908 ▁marry, ▁Your, ▁your, ▁my, ▁My, ▁their
    4981 ▁Person, ▁magic, af, ▁Mal, imal, ▁Notre
    4984 amp, ead, ylvan, itch, ires, ▁drag
    5055 %., '., )., ., "., ."
    5057 <0x0A>, 3, 4, 5
    5171 ▁illegal, ▁legal, ▁ban, ▁allowed, ▁prohib, law
    5309 ▁in, ▁among, ▁In, ▁across, ▁during, wed
    5310 ▁phrase, ▁term, ", word, OS, ingo
    5330 ▁exact, ▁precise, ▁reliable
    5413 ▁foot, s
    5465 :, ▁Is, ▁Are, ▁Does, ▁Was, ▁How
    5517 ▁Sydney, ka, ington, apolis, ▁Chicago
    5557 ▁remain, ▁yours, ▁activities, ▁films, ▁subjects, ▁song
    5565 ▁', ▁", ▁word, E, ▁phrase, but
    5624 ▁planet, ▁systems, ▁potential, ▁unique, ▁similar, ▁phase
    5639 ▁shares, ▁gets, ▁got, ▁smoke, umes, ▁produces
    5687 )
    5704 question, ▁prompt, ▁fact, ▁question, ▁shared, ▁instruction
    5852 ▁doubt, ▁seen, ▁told, ▁sure, ▁shown, ▁personally
    5862 <s>
    5890 och
    5922 ▁a, ▁A, ▁an, An
    5942 <s>
    5968 ▁Why
    6003 key, it, ▁rabb, ▁mouse, ▁husband, aker
    6009 vis, ▁Steve, ary, ▁baby, ▁Boston, ▁Scottish
    6019 ▁Pot, ▁Harry, ▁Row, iz, arts, w
    6066 ▁learned
    6147 ▁Uruguay, ▁Chile, ▁sib, ▁Luxemb, ▁Sib, ▁Pakistan
    6215 ▁There, ▁Nothing, here, ▁no, ▁nothing, ▁Now
    6319 inos, avia, ▁descent, enders, ▁third, ▁budget
    6348 <s>, <0x0A>
    6374 etes, ama, ▁cookies, rio, esa, ▁Light
    6420 ▁The, ▁Break, ▁Si, ▁Sig
    6660 :, ▁Despite
    6661 ▁element, ▁animal, ▁desert, ▁factor, ▁university, ▁sport
    6702 ▁the
    6739 ▁U, ▁Des, ▁Cur, ▁Sy, ▁Diet, ▁fam
    6744 ▁dawn, ey, working, ulf, XT, ▁saf
    6778 ▁mount, ▁identify, ▁specific, ▁onto, ▁predict, ▁let
    6997 ER, ▁Independ, ▁Little, ▁navig, ellow, anst
    7031 ▁Chile, ▁Venezuela, ▁China, ourg, ▁Canad, ▁Switzerland
    7082 ▁Some, ▁some, ▁sometimes, ometimes, ▁kinds, ▁Many
    7094 <s>
    7104 ▁Q
    7130 <s>, round, ▁Goth, ctions, ▁attra, ▁architecture
    7154 we
    7216 ▁varied, rane
    7224 ▁No, ▁non, ▁Every, ▁Near, ▁Non, ▁Last
    7231 working, orf, round, itte, XT, ▁dawn
    7257 ometimes, Only, ▁whole, ought, You, ▁peace
    7271 ▁tin, il, ▁silver, ▁wooden, ▁hat, ▁fo
    7297 ▁Mus, ▁Mun, ▁bos, ▁mus, ▁Lis, ▁Cur
    7306 ▁United, ▁Republic, ▁Council, ▁Middle, ▁Great, Un
    7372 ▁improve, ▁helps, ▁causes, ▁extend, ▁boost, ▁affect
    7381 ▁can, ▁Can, ann, ▁canon, ▁cannot, ▁could
    7415 ▁largest, ▁animal, ▁giant, ▁living, ark, ▁large
    7441 ▁Star, ▁Little, le, ▁Dragon, Tw, AS
    7448 )
    7490 ▁knows, ▁know, ▁knew, ▁agree, ▁admit, ▁learned
    7511 ▁tum, cin, ▁aut, ▁cancer, etes, ism
    7602 ▁composer, ▁unknown, ▁specify, ▁unclear, ▁individual, ▁recorded
    7624 ▁UK, ▁Florida, ▁Bible, ▁US, ▁estimated
    7653 :
    7673 ▁someone, etal, ▁baby, ▁determined, ▁determine, ▁sex
    7716 2, 3
    7787 ▁Sym, ▁Ult, ▁kin, ▁cart, ▁Linear, ▁Ge
    7831 <s>, ▁entrepr, pr, rane, ▁Q, ord
    7833 ▁Orange, father, acre, ust, ye, ▁List
    7883 <s>
    7907 <s>
    7980 DP, ▁terms, ▁per, ▁median, ita, ▁income
    8023 ▁extr, rial, ▁over, ▁origin, ▁root
    8026 ▁Pennsylvania, ▁Carolina, ota, las, ▁Alabama, hner
    8078 ▁am, ▁I, ▁My, m, I, ▁my
    8091 ▁Three
    8095 ▁, ▁$, 9, ▁last, /, ▁War
    8117 ▁without
    8135 ▁countries, ▁cities, ▁country, ▁nation, ▁county, ▁city
    8144 4, ▁four
    8181 ▁pos
    8206 what, ▁what, ▁which, )?, ▁situations, ...
    8221 ▁root, imate
    8227 ▁reflection, ▁stick, ▁while, ically, ▁dropped, ▁inform
    8287 ▁soon
    8314 ▁You, You, ▁Your, ▁They, ▁you, ▁We
    8367 ▁easiest, iest, ▁biggest, ▁largest, ▁favorite, ▁interesting
    8376 ▁depends, ▁corner, ▁distinct, ▁Because
    8481 ▁mention, ▁discuss, ▁use, ▁accept, ▁change, ▁hid
    8484 ▁Q, ▁All, ▁Every, here, ▁Part, ▁Near
    8515 ▁similar, ▁valuable, ied, ▁properties, ▁systems, ▁notable
    8536 ▁November, ▁August, ▁July, ▁pm, /, ▁May
    8547 ▁wall, ror, ▁mirror, ▁beautiful, ▁anymore, ▁Little
    8635 ener, ▁grow, ▁back, ▁reg, ▁grows, ▁two
    8675 XT
    8737 pan, ▁Muslim, ▁Korean, ▁Asian, ▁Lat, ▁Chinese
    8761 ▁smoke, ▁consume, umes, ▁drink, ▁shares, ▁work
    8812 ▁list, ment, ▁Way, ies, ames, ancy
    8842 ▁new
    8877 ?, "?, )?, ?", ▁compared, ▁compare
    8925 atic, edy, ▁reserved, ▁curious, ▁earnest, ▁friendly
    8954 ack, ▁Ob, ardo, ▁Mitt, ▁president, ille
    8960 ▁either, ▁could, ▁may, ▁fall, iety, ▁possibly
    8978 <s>
    9008 ▁United
    9036 <s>
    9069 ▁most, ▁else, ▁least, ▁highest, ▁priority, ▁Most
    9270 XT
    9288 ▁fact, ▁factor, ▁truth, ▁factors, ▁principle, ▁belief
    9384 <s>
    9447 aten, ▁treatment, ▁shows, ▁where, ▁contribute, ▁guarantee
    9487 ▁twenty
    9526 ▁nearly, ▁where
    9535 ulf, ▁cultural, ▁divers, ▁looks, ouses, round
    9546 <s>, 1, 2, ▁(, 3, 4
    9566 aking, ▁rub, hing, ▁tie, ▁touch, ▁disturb
    9592 ▁than, ▁near, qual, ▁require, ▁Among, aller
    9648 ▁six, ▁days, ▁created, ▁gradually, ▁create, ▁Adam
    9660 ▁passenger, ▁produces
    9729 ▁The
    9765 <0x0A>, ▁strik, ▁Chart, aret, ▁mic
    9785 ▁location, ▁ambigu, ▁depends, ▁treated, ▁circumstances, ▁position
    9796 ▁add, ▁extend, ▁shares, ▁numbers, ▁smoke, ▁modify
    9814 ▁helps, ▁turns, ▁determine, ▁soon, ▁showed, ▁cle
    9837 ▁years, ▁minute, ▁year, ▁ten, ▁enough, pm
    9866 ▁Council
    9888 ▁then, ▁welcome, ▁nothing, ▁knock, ▁will, ▁instantly
    9909 ▁hard, ▁worker, ▁harder, ▁effort, ▁efforts, ▁lazy
    9973 ▁outside, ors, ▁weather, ▁out, ▁paths, ▁selected
    9981 ▁Why
    10045 <s>
    10066 ▁England, ▁Great, ▁EU, ▁English, ▁Italian, ▁Britain
    10136 ▁and, ▁or, ▁while
    10138 ▁shown, ▁demonstrated, ▁proven, ▁accepted, ▁confirmed, ▁displayed
    10175 ▁visited, ▁set
    10183 ▁mother, ▁cord, ▁them, ▁they
    10207 inking, ▁moder, ▁quantities, ▁too, ▁dos, ▁consumption
    10213 ▁audience, ▁causes, ▁cause, ▁ru, ▁creates, ▁play
    10253 ling, opy, ten, ▁Bow, iele, ool
    10331 ▁asc, ▁commission, gu, ▁struct, fl, ▁transport
    10348 S, ▁US, ▁USA, .,, ▁States, ▁American
    10458 ights
    10512 ▁visible, ▁jump, ▁sink, ▁lifted, ▁painted, iled
    10519 ▁biggest, ▁highest, ▁largest, ▁smallest, ties, ▁city
    10523 Q, ▁question, ▁questions, q
    10593 ▁Albums, ▁Records, ▁records, ▁Earth, ▁Songs, ▁albums
    10616 ▁dollars, ▁much, qual, ▁year, ▁average
    10639 7, 8, 9, ▁seven, ▁Seven
    10656 ▁located, ▁host, ▁contain, ▁selected, ▁love, ▁spent
    10710 ▁increased, ▁decl, ▁harder, ▁expensive, ▁stayed, ▁less
    10738 ▁video, ▁record, ures, ▁Video, ▁end, ▁substitute
    10796 ▁Sydney, ▁Dublin, ▁Chicago, ▁Toronto, ington, ways
    10798 2
    10866 gate, win, so, XT, uru, ▁Columb
    11069 <s>, ▁Dom, ▁dawn, ▁fran, board, fe
    11083 <s>, ray, <0x0A>, ▁Found, eu, clam
    11120 <s>, qual, all, erves, ▁players, wed
    11218 ▁score, ▁plants, ▁incident, pper, ▁success, market
    11229 ▁tower, ▁diverse, ▁vast, enth, ▁varied, XT
    11251 ▁Burn, ▁burning, ▁burn, une, ▁fortune, ec
    11269 aw, ains, work, uda, ▁Mass, mouth
    11287 Real, XT
    11297 ames, ▁sometimes, ▁great, ment, ▁top, ▁lets
    11302 ▁player, ▁president
    11354 ▁entrepr, pr, rane, ord, ▁able
    11411 <s>
    11549 ▁round, ▁flat, ▁shape, ▁particle, ▁float, ▁forward
    11560 fo, ef, ▁tea, nab, ▁lung, ung
    11584 ▁leader, ▁released, ▁plays, ▁singer, ▁monarch, ▁achieved
    11662 ▁else, ▁anywhere, ▁other, ▁source, ▁places, ▁countries
    11704 <s>
    11827 :, ▁How, ▁Pay, ▁What, ▁Rel, Q
    11841 <s>
    11856 ), :, 3, 4, ▁Yes, ▁No
    11941 ▁stand, ▁stood, ▁stands, ▁refers, ▁refer, ▁mean
    11943 <s>
    11947 ese, MI, Is, ▁pover, ▁ob, ▁inequality
    11970 ▁gives, ▁Their, ▁wore, ▁provides, ▁stood, ▁should
    12038 ▁wall, eth, ▁finger, ror
    12046 ▁yours
    12097 ▁produces, ▁led, ▁directed, ▁wrote, ▁gets, ▁makes
    12156 ▁suffer, ▁suff, ▁damage, ▁experience, ▁receive, ode
    12199 ▁measure, ▁players, ▁cars, ▁oil, ▁results, ades
    12215 ▁shared, )?, ▁composition, ?, ▁characteristic, ▁song
    12357 burg, ija, ellers, ▁Garden, alem, named
    12434 ▁Type, ▁Pow, ▁Bl, ▁Altern, ▁Crit, ▁Sm
    12453 ▁further, ▁feet, ▁closer, ▁or
    12490 ▁twenty, ▁next, ▁tries, night, ▁years, ▁threatened
    12496 ▁sink, rown, ode, ▁shoot, ▁kick, ▁lifted
    12599 ▁required, ▁always, ▁typically, ▁enjoy
    12812 ▁turns, ▁turned, ▁into, ▁new, ▁generate, ▁teach
    12814 ▁Egypt, ▁Austria, plane, ▁River, ▁Africa, ears
    12931 ▁hour, ▁minutes, ▁wait, ▁Wait, ▁before, ▁weeks
    13058 <0x0A>
    13133 ▁among, ▁since, ▁twenty, ▁terms, ▁decl, ▁today
    13152 ▁no
    13201 ▁hitting, ank, itting, ▁child, ▁hit, ▁domestic
    13221 ▁produces, ▁stands, ▁Science, ▁stood, ▁gets, ways
    13327 ▁Q
    13352 ating, iders
    13360 ▁Sig, ▁Claud
    13366 ▁aren, ▁doesn, ▁hasn, ▁isn, of, ▁strik
    13371 ▁relative, ▁forb, ▁subjects, ▁equipment, ▁unusual, ▁brand
    13412 <s>
    13414 ▁cookie, ▁lamp, ▁television, ▁foot, ▁hat, ▁score
    13447 ros, ▁eu, ▁Eu, cs, ▁kr, ▁fran
    13462 ▁rice, ave, omy, ▁passenger, ▁VIII, imming
    13463 ey, ▁Q, LS
    13567 erson, enberg, we
    13644 ▁entrepr, pr, rane, ord
    13682 ▁optimization, ey, <s>, ue
    13701 %.
    13710 ▁bars, ▁hit, ▁partner, ▁gun, ▁defense, ▁purposes
    13745 ▁further, ▁feet, ▁or
    13767 ▁vo, ▁kar, ▁contract, ▁por, ▁ing, ▁ant
    13776 ▁full, ▁perfect, ▁absolute, ▁perfectly, ature, oked
    13779 <s>
    13814 ll, ▁will, ▁would, ▁Will, ▁notice, ▁instantly
    13847 ▁Montreal, ▁Amsterdam, ▁Seattle, ▁Boston, ▁Philadelphia, ▁Virginia
    13867 ▁Science, ▁scientific, ally, ▁Scient, ▁scient, ▁experiments
    13920 question, ▁word, ▁words, ▁once, hand, ▁individual
    14013 ▁Books, ▁records, ▁books, ▁Albums, ▁Records, ▁films
    14154 place, ▁afternoon, ▁evening, ▁corner, ▁outside, ▁lit
    14165 ▁shouldn, ▁acknow, ▁mod
    14216 ▁today
    14222 ▁No
    14307 alt, ril, ina, icole, ardo, ifer
    14393 ries, ▁Books, ▁People, ▁places, ▁group, ips
    14447 ▁pushing, anim
    14462 ▁Fl, ▁AT, ▁Sil, ▁ver, ▁Th, ▁bill
    14622 ▁Yes, ▁No, ▁Nothing, ▁depends, here, ,
    14646 .", '., ▁purposes, ests, cy, ways
    14660 ellow
    14684 ▁circle
    14703 ads, ise, ises, urus, arks, igs
    14708 ▁Theorem, laration, ws, ▁Independ, clam, amental
    14737 ▁bos, ▁grasp, ▁overcome, ▁purpose, ▁am, ▁move
    14767 ▁cig, ar, igare, ▁anymore, ▁watched, ▁Kansas
    14775 1
    14807 <s>
    14823 ▁returns, %, ▁mile, ▁year, ▁every, ▁scores
    14854 ▁Q, <0x0A>, 2, 3, )., )
    14932 ▁Joe, ▁Benjamin, ▁Adolf, ▁Christopher, ▁Larry, ▁Michael
    14974 ▁ideas, iration, ▁insp, ative, ision, ▁cre
    15063 ▁Nick, ▁Pay, ▁Ul, ▁Son, ▁Non, ▁reads
    15067 ▁winter, ▁summer, ▁February, ▁Sunday, ▁afternoon, ▁villa
    15080 ▁modern, ▁buildings, ▁dawn
    15231 ▁Montreal, ▁Indians, ▁Amsterdam, ▁har, icans, ▁Rus
    15237 inf, ▁rain, ining, ▁snow, all, so
    15318 ."
    15350 ▁USA, ▁Video, ▁Records, ▁Sm, ▁Crit, ector
    15438 ▁purpose, vention, ▁invent, ▁origin, ▁evol, ▁precise
    15467 ▁Hill, ▁El, ▁Bern, ▁Fund, amental, ▁Jenn
    15506 iju, nab, itution, ▁Dru, ▁burning, rooms
    15518 och, ky, ▁Notre, ▁Lanc, ess
    15552 ▁Only
    15563 ▁entrepr, pr, rane, ord, ▁able, orf
    15591 ▁been, ▁turns, ▁helps, ▁unsafe, ▁had, ▁spent
    15635 ▁Mount, ▁Saint
    15636 ▁examples, ▁example, ▁characteristic, ▁por, ▁Are, ▁some
    15707 ▁gets, ▁produces, umes, ▁consume, odia, ▁slightly
    15721 ways, xygen, ▁Bush, ▁Columbia, ▁carbon, ▁Jordan
    15751 (, <0x0A>, Q, 5, 6, 4
    15763 ▁won, ▁win, aten
    15835 <s>
    15849 ▁rest, ▁criminal, ▁face, ▁trial, ▁tries, ▁Little
    15864 ▁weeks, ▁across, ▁miles, ▁months, ▁million, ▁drive
    15872 ▁values, ▁prices, ▁rates, ▁costs, comes, ▁price
    15987 ▁involve
    15999 XT, ▁dawn, LS
    16028 ▁university, ▁city, ▁mode, ▁island, ▁Saint
    16029 ▁average, verage, ▁median, ▁approximately, ▁typically, ▁estimated
    16030 ▁event, ▁activities, ▁subjects, icas, ▁trait, ▁date
    16041 ▁Afr, ▁Indians, ▁Spanish, ▁Japanese, ables, ▁Portuguese
    16101 ▁organized, ▁shed, ▁playing, cial, ▁passenger, ographic
    16110 ▁transform, ▁knock, ▁invoke, ▁fall, ▁join, ▁lifted
    16129 A
    16138 <s>
    16219 ▁grown, ▁necessarily, ▁food, ▁bread, ▁consumption, ier
    16413 ▁plants, ables, ▁Asia, ▁veget, pes, ▁science
    16574 ▁finish, ▁graduated, unk, ▁college, ▁school, ▁high
    16578 ▁lines, ▁position, ▁positions
    16617 lam, po, augh, inden, enberg, ait
    16637 ▁letters, ym, ▁word, ▁letter, ▁abbre, ▁phrase
    16649 ▁exact, ▁composition, ▁song, our
    16659 ▁located, ▁selected, ▁further, ▁host, ▁official, ▁contain
    16786 ▁letter, ▁named, ▁phrase, ▁word, ▁type, ▁color
    16798 ici, icy, ili, pper, eds, ▁pe
    16806 ▁dess
    16811 ▁entrepr
    16828 ▁further, ▁closer, ▁feet
    16837 ▁fortune, une, oo, gly, ▁Iron, iger
    16884 FI, ▁Time, ▁Scient
    16907 ▁pover, MI, ▁rib, ▁hours, ▁income, DP
    16973 ▁studied, ▁study, ▁imagine, ▁prep, ▁hard, cis
    17013 ▁(, A, :, 5, 4, 6
    17061 ▁than, ▁Mount, iders, ▁Sam, ▁Bel, ▁San
    17141 ▁independent, ization, ▁joined, ▁conquer, ▁colonial, ony
    17224 ▁(, (, 1
    17232 ▁entrepr, pr, rane, ord, ▁able
    17397 ▁particular, ▁individual, ▁specify, ▁correlation, ▁normally, ▁necessarily
    17466 <s>, uda, ellow, ky, /, fr
    17481 ▁Nothing, ▁nothing, ▁everything, ▁soon
    17621 ▁build, force, ▁en, ▁via, ▁perform, lict
    17683 ices, ice
    17739 ▁if, ▁unless, ▁when, ▁because, iting, ▁case
    17818 ▁among, ▁involve
    17841 ▁because, ▁Because, ▁then, although
    17850 ▁have, ▁Have, ▁I
    17865 <s>
    17916 las, enz, Out, enberg, lain, umann
    17944 ▁resources, alem, LS, ▁varied, ouses, stal
    17968 arter, ▁smart, ▁minds, ▁performance, ent, ▁intellig
    18023 ▁Austria, ▁Wales, sh, ▁differently, ▁they, ▁pay
    18108 erves, all, ▁well, fall, ▁level, ▁big
    18155 <s>, ulf, ▁Q, agger, itan, ▁oldest
    18162 ▁more, ▁fewer, ▁less, ▁lower, ▁bigger, ▁greater
    18199 ym, National, All, ▁abbre, Sh, For
    18207 inc
    18208 ▁If
    18226 ▁try, ▁news, ▁currently, inos, ▁similarly, ▁fans
    18426 unk, ▁terrible, ▁student, ▁teachers, ▁physics, ▁graduated
    18497 ▁lie, ▁lies, ▁lying, ▁false, ▁li, ▁statements
    18530 ▁VIII, omy, ▁passenger
    18844 ▁Light, rio, ▁Cruz, ▁Egypt, ▁aircraft, feld
    18849 ▁What, ▁How, ▁Who, ▁Which, ▁Where, ▁Why
    18873 ▁entrepr, ▁Q, pr, rane, ord, lang
    18903 ys, ▁bu, age, ▁loan, ▁marks, ▁purchase
    18967 ▁top, ▁recent
    19018 <s>, ▁remains, ▁still, ames, ▁Diet, ▁totally
    19039 ▁definite
    19147 ▁How, ▁Where, ▁What, ▁Who, ▁how, ▁Which
    19234 ▁position, ▁lines, ▁pow, ▁positions, der, ▁liquid
    19326 ▁F, ▁C, ▁Bal, ▁RA, ▁Bor, ▁L
    19352 ▁USA, ays, ▁Scottish, ▁approximately, ▁Fund, ley
    19394 ▁Chart, ▁Bl, ▁rein, ▁Deep, ▁Rain, ▁wat
    19457 ▁tells, ▁keeps
    19564 ▁You, ▁I, You, ▁It, ▁We, ▁My
    19588 <s>, ouses, ▁overcome, cket, ▁comedy, ▁fame
    19600 ▁Brit, ▁USA, ▁Americans, ▁Men, ▁Napoleon, ▁Rich
    19604 <s>
    19620 ▁equally, ▁similar, ▁as, ▁well, ▁same, ▁similarly
    19672 enda, ▁Order, ma, ▁yards, ters, ingo
    19717 imal, ay, an, angol, ino, cel
    19741 ▁split, ▁handles, ▁shape, ▁officers, ▁answers, vention
    19799 very, ▁restored, uses, icted, inction, itable
    19838 laimed, ▁ranking, ▁hub, ▁capital, ▁attra, ▁facilities
    19839 ▁Catholic, ▁doctrine
    19947 here, ▁Nothing, ▁nothing, ▁Many, ▁Albums, ▁few
    19953 icks, ▁stuck, ically, ▁stick, lla, raw
    19966 isons
    20119 ort, ▁Bl, ▁Sib, ▁ap, ▁Ste, ▁Brun
    20147 ▁depends
    20184 ▁purpose, ▁easiest, ▁useful, ▁risk
    20293 ▁gen, ▁shares, ▁share, ▁percentage, ells, ▁neur
    20407 ▁third, %, ▁significantly, ▁proportion, ▁percent, ▁budget
    20427 ▁Tal, ib, ovi, ▁Afghan, ▁Pers, ▁Confeder
    20441 ▁released, ▁achieved, ▁leader, ▁studied, ▁gained, ▁later
    20646 ▁further, ▁feet, ▁closer, ▁close, ▁closest, ▁miles
    20659 ear, aring, foot, ▁wear, othing, ▁wrap
    20673 ▁Prize, riz, laimed, ▁star, ▁attempt, ▁professional
    20681 ▁said, ▁Great, ▁started, ▁wrote, ▁part, ▁behind
    20687 ▁winter, ▁February, ▁Sund, ays, ▁breakfast, ▁cold
    20698 ▁Kingdom, K, ▁UK
    20784 <s>
    20800 enn, pan, ni, ▁Muslim, olog, ▁Lat
    20945 ▁entrepr, pr, rane
    21027 ▁construction, ▁development, vention, ▁existence, ▁founder, ▁approach
    21034 <s>
    21231 ties, ▁guns, ates, ▁players, ▁scores, aches
    21236 ▁scheme, agers, ▁then, ▁work, ▁working, ▁running
    21249 ▁new, ▁some
    21250 orney, ▁lawyer, estic, uses, iot, ▁caught
    21328 ▁right, ▁while, ▁non, board, ▁style, ▁Rem
    21463 ▁calling, ▁asking, ▁searching, ▁testing, ▁hot, ▁contact
    21518 ▁The, ▁Their, ▁Your, ▁These, ▁Our, ▁My
    21577 ▁been, ▁seen, ▁sometimes, ▁shown, ▁now, ▁Have
    21682 ▁nin, ▁song, ▁vo, ▁ing
    21683 ax, ional, aged, usion, ▁fict, ▁plot
    21728 <s>
    21851 ▁world, ties, ▁universe, ▁Way, ▁Asia, ▁sky
    21856 XT, ▁dawn
    21867 ▁characteristic, ▁activities, ▁yours
    21947 etal, ▁minute, ▁below, ▁heart, ▁vary, ▁rate
    21963 3, 2, 1, (, 4, ▁produces
    21979 ▁Hun, ▁Per, ▁Tr, ▁Ro, ▁Ban, ▁Ts
    21987 ▁poor, ▁separately, ▁Rich, ▁rich, ▁husband, ▁differently
    22052 ▁Sydney, ▁Columbia, eton, ▁Tr, ▁Manchester, ▁Pr
    22055 )
    22060 arts, ▁this, ▁This, ▁students, ▁These, ule
    22070 <s>, ▁Q
    22073 ?, )?, "?, ?", ▁compared, ▁differ
    22167 ▁entrepr
    22201 ▁sales, ▁trick
    22216 ▁happens, ▁happened, ▁happen, ▁action, ▁occurs, ▁occur
    22242 ▁across, ▁disturb, ▁stick, ▁draw, ▁containing
    22311 ▁dro, ▁ver, ▁pro, ▁bill
    22353 ancy, ▁extend
    22357 ▁ways, ▁suffer, ▁negative, ▁overcome, acles, ▁suffering
    22367 ▁struck
    22388 ▁include, ▁although, ▁-
    22444 ▁thirty, ▁square, ▁twenty, ▁nine, ▁ten, ▁seven
    22457 ▁consume, umes, ▁produces, rank, ▁designed, ▁eat
    22471 ▁figures, ▁element, ▁player
    22624 ▁leader, ▁released, ▁tou, who, ▁achieved, ▁Name
    22722 ▁varied, ▁tower, ▁diverse, ied, ▁stor, ▁cultural
    22730 ▁Mex, ▁Mexican, ▁Mexico, ▁Afr, ▁Puerto, ▁Afghan
    22822 ▁million, ▁evening, ▁AM, ▁billion, time, ▁inches
    22951 ▁exhib, Os, ▁leadership, going, ▁leaders, like
    22967 ▁Fre, ▁Mel, ▁Or, ▁Ald, ▁Abd, ▁Hay
    23119 ▁sand, ▁Christmas, ▁Kansas, ▁Santa, den, pop
    23123 ▁There, ▁there, ▁reliable, ▁currently, ▁various, ▁strong
    23143 2, 3, 2004
    23256 date, <0x0A>, number, )., ., ▁greater
    23307 ▁incident, ▁Way
    23440 nake, ras, urn, ▁Year, ▁Lib, ▁Sat
    23604 <s>, <0x0A>
    23609 ▁happens, ▁happen, ▁contribute, ▁ban, ▁factors, aten
    23629 ears, igs, ogs, rows, ▁Fox, xes
    23630 A, 5
    23661 ▁(
    23725 ▁luck, ucky, ▁sorrow, ▁prosper, ▁visitors, ▁welcome
    23765 ▁mod
    23790 ▁extr, tras, ▁verte, ▁prec, ▁extend, ▁vide
    23807 ey, igger
    23850 uten
    23933 ▁Why, ▁Way, ▁What, ▁Who, ▁How, ▁Which
    24077 ▁Time, ▁selected, ▁Person, ▁list, FI, icious
    24086 ▁occurs
    24141 ▁humans, ▁human, ▁male, human, ▁mascul, ▁professional
    24204 anned, wed
    24283 ▁feet
    24326 ▁agree, ▁definite
    24329 ▁exc, ▁talk, ▁cle, ▁rule, ▁determine, ▁check
    24355 ▁cry, ▁sad, ▁died, ▁sorrow, ▁die, ▁laugh
    24367 <s>
    24458 <s>, pl, ▁Mars, ▁Circ, cc, mar
    24460 alt, ▁Black, ▁Deep, ack, icole, ▁Rain
    24471 ▁varied, enth, ▁valuable, ▁TV
    24522 ▁letter, C, ▁contain, ▁', ▁letters
    24664 ▁low, ▁left, ▁Bl, ▁port, ▁Low, ▁boys
    24719 ▁Only, Only, ▁required, ▁only, ▁allowed, ▁need
    24746 ▁showing, ▁shows, ▁That, ▁suggests, ▁that, ▁showed
    24909 ▁Name, ▁title, ▁name, ▁Last, ▁named, ▁called
    24926 <s>
    24969 ▁orange, enta, ▁blue, ▁red, ▁yellow, ▁Black
    24994 ▁Nobel, ▁Prize, ▁Nations, ▁won, ▁EU, amental
    24996 A, af, ▁Am, imal, 3, ▁mod
    25107 ▁Americans, avia, ▁descent, ians, ▁USA, ▁Dutch
    25132 ▁vs, ▁differently, ▁compared, ▁greater, all, ▁per
    25134 ▁I, ▁My, I, ▁personally
    25204 ▁containing, ▁playing, oked, ▁your, ▁electric, ▁'
    25267 ▁requires, ▁variable, ▁attributed
    25336 ▁north, pie, ▁Building, pm
    25346 ▁Rich, olog, cover, ▁Brit, pan, ▁Fire
    25494 ., .", ible, ▁device, %., !
    25582 ▁All, ▁everyone, anim, ▁Every, ▁all, ▁always
    25613 ▁If
    25640 ▁contain, ▁gone
    25676 ▁called, ▁stood, ▁stands, elled, ▁connected, ▁comes
    25793 .", '., "., apy, ▁device, ada
    25838 ▁University, ▁Airlines, ▁Burg, rand, ▁City, ▁university
    25896 ▁kept, edy, ▁trick, sters, ▁confident, ▁gre
    25969 ishes, laimed, ▁accomplished, fs, airs, ▁cook
    26119 ▁entrepr
    26164 ▁letter, ▁located, ▁mile, digit, ▁double, ▁host
    26235 ▁long, ▁length
    26266 ▁USA, ▁Kingdom, ▁Pennsylvania, ▁States, ▁Poland, ▁Israel
    26339 ▁only, Only, ▁Only
    26351 ▁does, ▁do, ▁Do, ▁Does, ▁did, ▁Dor
    26400 Q, A, :, ▁doesn, ▁Q, ▁hasn
    26424 po, pshire, ▁Jose, aven, ▁Luis, yth
    26429 co, ina, ril, ifer, ardo, icole
    26544 fr, ming, ▁war, ▁climate, ▁global, imal
    26571 ▁time, ▁among, ▁histor, ▁ancient, ges, ▁gradually
    26578 <s>
    26635 ▁you, ▁stick, ▁leave, ▁walk, ▁put, ▁draw
    26649 ▁There, ▁I, ▁It, ▁She, ▁We, ▁They
    26706 ▁creates, ▁will, ▁experience, ▁receive, ▁welcome, ▁determine
    26746 ▁rice, ▁fan, allow, lla, ▁cookie, ave
    26848 ▁town, ▁road, ▁miles, ▁country, ▁club, ▁side
    26862 ▁ambigu, ", ▁stands, ous
    26922 round, ▁Democratic, ▁historic, ▁educational, ▁solo, ouses
    26998 <s>, <0x0A>
    27095 ▁husband
    27187 ▁Science
    27218 ▁lots, ▁slightly, apes, ▁spect, ▁daily, ▁historic
    27234 qual, ancy, ifies, ays, erves, all
    27240 ▁real, ▁exist, Real, ▁happening, ▁Real, ▁true
    27245 ▁What, ▁How, ▁Which, ▁Why, ▁what
    27267 <s>
    27386 key, ino, it, ▁rabb, aro, el
    27415 If, ▁If, ▁if, ▁unless
    27485 1
    27558 ▁non
    27616 ce, ▁Joy, ss, ▁Allen, ▁Or, ▁Hem
    27628 able
    27644 ▁Asian, ni, ▁Austral, ▁Kore, ▁Asia, ▁Austria
    27676 ▁loan
    27710 ▁Any, ▁inf, ▁refer, ▁uses, ▁individuals, Re
    27755 ▁viewed, ▁feet, ▁threatened, ▁stop, ▁across, ▁disturb
    27802 ▁you, ▁they, ▁viewed, ▁should, ▁go, ▁she
    27864 ym, ▁correl, ▁related, ▁establish, ▁vary, ▁ac
    27888 action, ▁comment
    27944 ▁because, ▁powerful, vement
    27956 inton, ley, k, ner, ▁Trump, ▁Pres
    27968 elling
    27992 izers, ▁strong, orous, ▁leading, ▁substitute, ▁or
    28012 <s>, num, ▁entrepr, ▁Belg, ouses, cc
    28106 ▁alternative
    28217 ▁Where
    28239 <s>
    28242 LS
    28248 <0x0A>
    28435 ▁gained, ▁get, ▁got, ▁gets, ▁won, ▁getting
    28441 <0x0A>
    28467 ▁Fire, ▁Iron, ▁Black, ▁Rich, oo, ▁Rock
    28497 ▁Zealand
    28501 ▁since, ▁moment, ▁after, ▁past, ▁November, ▁recent
    28518 <0x0A>
    28524 free, organ, Out, ic, unct, iki
    28546 ▁eight, ▁five
    28616 ▁afford, ability, able, ▁expensive, roll, ys
    28648 ating
    28672 olen, ference, ud, ged, ▁fra, ▁rig
    28678 ▁Dom, opy, stal, ey, sf, esp
    28701 ▁Yes, ▁No
    28758 <s>
    28790 ▁entrepr, pr, rane
    28802 <s>
    28876 ▁strik, aret, ▁lov, ▁prem, ▁cas, ▁amb
    28953 C, digit, ▁risk, ▁letter, ▁dollars, ▁bias
    28991 ▁NY, ▁York, ▁Los, ▁Angeles, Los, New
    29067 ▁extend, ancy, ifies
    29168 ▁entrepr, rane, pr
    29218 ▁due, ▁Because, ▁because, ▁refers, ▁composed, ▁comm
    29256 ▁lamp, ▁bed, ▁bird, pie, ▁Building, ▁north
    29283 ▁add
    29307 anim, ▁All, ▁perfectly, ▁nearly, ▁all, ▁guarantee
    29362 ▁than, ▁among, ▁little, ▁since, ▁else
    29389 ▁All, ▁third, ▁Most, ▁some, ▁all, ▁majority
    29446 A
    29510 ▁rising, ▁rise, ▁value, ▁up, ▁ranking, ▁stock
    29542 ices, ▁goods, comes
    29573 ▁requirements, ▁correlation, ▁specify, ▁no, ▁capable, ▁not
    29656 !
    29683 <s>, <0x0A>
    29702 ▁, ▁$
    29715 ag, ▁Patri, ▁So, elt, &, ige
    29742 ▁Who, ▁Where, ▁date, ▁handles, ▁age, ▁event
    29754 ▁among, ▁target, ▁avoided, ▁For, ▁against, ▁specific
    29767 ▁Are, ▁Can, ▁Was, ▁Is, ▁Does, ▁Did
    29803 ▁independent, ▁efforts, inct, ▁weak, issues, ▁clouds
    29924 <s>
    29936 <s>, ▁Zealand, ▁Netherlands, ▁Florida, ▁Singapore, ▁Australia
    30006 bra, ined, ▁domin, blo, ▁brain, ▁bra
    30022 odia, ▁Bulg, ▁Hong, ▁Poland, ▁Camb, ait
    30123 ial, ▁election, ▁pres, ▁president
    30153 ,, ▁-
    30193 ▁cycles, ▁experience, ▁receive, ▁revert, ▁sync, stru
    30231 )
    30382 ▁next, ▁future, ▁last, ▁previous, ▁Future, ▁current
    30386 <s>
    30462 ▁Baby, ▁Sib, ▁Bl, ▁Organ, ▁Newton, ▁Crit
    30516 ▁n, ▁sk, ▁l, ▁po, ▁s, ▁y
    30523 ▁no
    30579 ▁list, fr, ▁while, ▁Orange, LS, ▁who
    30599 ▁official, ▁letter, ▁contain, ▁host, ▁located, ▁navig
    30608 ining
    30799 XT
    30927 ▁precise, ▁Because, ▁Five, ▁For, ▁These, ▁Far
    31031 ▁legal, ▁illegal, riminal, wed, anned, ▁allowed
    31120 5, ▁Five, ▁five, ▁percentage, ▁min
    31199 <s>
    31253 ▁big, ▁sometimes, ▁double, ▁mod, anim, ▁personally
    31254 ▁compared, ▁vs
    31263 ▁factors, ▁greater, date, ▁substitute, ▁marks, ▁variable
    31291 aking, <s>, ▁mention, ▁imagine, ▁buy, gly
    31337 ▁navig, able, itable, hab, ▁tender, ▁Har
    31381 <s>, A, <0x0A>, '., (, .
    31441 ▁today, ▁twenty, ▁now, ▁Now, ▁here, ▁thirty
    31585 although, reason, ▁unless, ▁provided, ▁although, ▁except
    31762 ▁star, ▁Story, ▁face, ▁Ryan, ▁Ra, Fri
    31817 ▁last, ▁value, ▁returns, ▁gone, ▁years, ▁every
    31862 ▁teach, ▁charge, ▁draw, ▁clean, ▁tie, ▁count
    31892 ▁she, bian, ▁recently, ▁means, ▁Because, ▁experienced
    31910 <s>, <0x0A>, ?", ▁prominent, ▁linked, ied
    31948 ▁mirror, raw, ▁backwards, ▁arms, um, lla
    31981 ▁No, ▁Yes
    31984 ▁Its, ▁Her, ▁His, ▁Has, ries, ▁She
    32001 <s>
    32050 ▁Which, ▁which, ▁This, ▁each, ▁various, ▁specific
    32128 XT
    32145 ▁lives, ▁Drive, ▁beneath, ▁live, ▁Street, ▁Baker
    32223 ▁said, ▁", ▁reads, ▁saying, ▁That, ▁says
    32330 ▁The
    32436 ▁entirely, ▁equally, ▁kinds, idents, ▁only, ▁all
    32448 ▁tie, ▁rub, ▁sees, aking, ▁wear, ▁touch
    32458 ▁sure, ▁doubt, ▁shared, ▁conclude, ▁established, ▁differ
    32487 <s>, '., <0x0A>, %., ▁produces, ▁occurs
    32531 ▁should, ▁shouldn, ▁need, ▁seek, ▁Should, ▁required
    32548 ▁incident, ties, market, ▁Egypt, apping, pper
    32549 hing, ▁entirely, ▁simply, acks, ▁capable, ▁spiritual
    32551 <s>
    32590 ▁police, ▁cop, ops, oss, utor, actor
    32678 ▁Chicago, ▁Houston, ▁Pennsylvania, ▁Toronto, ▁Miami, ▁Jersey
    32835 ▁face, cy, ▁file, ▁trial, ▁criminal, ▁charges
    32852 <0x0A>, 2, 3, ▁How, ▁What, 4
    32990 ▁Amsterdam, ▁Philadelphia, ▁Paris, ▁York, ▁har, ettes
    33047 <s>
    33058 ▁onto
    33112 <s>, <0x0A>
    33118 ▁American, ▁basketball, ▁European, ▁Jewish, ▁living, ▁Federal
    33179 ▁formed, ▁moved
    33191 ▁not, t, ▁cannot, not, ▁never, like
    33228 ▁fam
    33238 ▁increased, ▁gone, ▁rise, ▁stayed, ▁decrease, ▁rising
    33252 ▁Mil, pr, ▁exer, ▁came, ord, ▁Tw
    33291 ▁who, ▁where, who, ▁that, ▁containing
    33316 %., '.
    33332 ▁turns, ▁helps
    33361 ▁still, ▁remains, ▁currently, ▁now, ▁originally, ▁current
    33385 ▁helps, ▁unsafe, ▁occurs, ▁decrease, ▁turns, ▁moves
    33417 rest, ien, craft, iens, FO, cer
    33490 ▁disag, ree, ▁win, ▁variable, ▁depends, ▁fict
    33524 ▁suic, ▁streets, ▁ran, icked, ▁nest, ▁jump
    33607 acc, ▁divor, feed, rupt, ▁abort, MR
    33642 ▁again, ▁expecting, ▁results, ▁thing, ▁doing, ▁fear
    33658 ▁restored, zy, cer, ▁ticket, ▁Jack, ppets
    33727 erson, iro, ▁Franklin, ela, umann, mart
    33832 ▁owner
    33876 eton, ale, ▁Columbia, ▁Harvard, keley, inc
    33902 ▁than, date, number, aten, ▁beat, ▁vs
    33988 ▁vs, ▁compared, ▁than
    33996 ▁declared, ▁king, ▁rule, ▁kingdom, ▁considered, ▁prince
    34063 4, here, ▁Blood, ▁Wait, ▁Blo, ▁No
    34075 ▁playing, ▁involve, ▁shed, ▁per, ▁total, ▁tries
    34111 <s>, <0x0A>
    34136 eth, orney, ▁lawyer, ▁television, ▁mention, ▁cookie
    34159 aller
    34190 ,
    34245 ▁Are, ▁Have, ▁Was, ▁Does, ▁Did, ▁Do
    34267 <s>, <0x0A>
    34355 ▁formed, ▁moved
    34488 ▁cousin, ▁relative, ▁sib, ▁marriage, ▁grand, lings
    34504 licate, ▁establish, ▁experiments, ▁method, ▁rep, ▁showed
    34550 ▁passenger, imming, omy, ▁metric, ades, ime
    34565 (
    34600 ▁Part, ▁wave, ▁stretch, ▁idea, ▁den, ▁part
    34605 aten, ▁fatal
    34656 ▁easiest, ▁smallest, ▁closest, iest, ▁favorite, est
    34776 ▁by, ▁By, ▁via, by, ▁using
    34847 )
    34885 ▁Poland, ▁Pennsylvania, ▁Israel, ▁Brazil, ▁Oregon, ▁Jersey
    34891 1
    34931 ▁are, ▁were, ▁was, ▁is, ▁am, ▁entirely
    34955 ▁If, If, ▁if, ▁Because, ▁When, ▁unless
    34961 <s>
    34972 keys, ige, ▁Tennis, ▁Butler, ▁Arizona, ▁carbon
    35030 ests, ka, ole, au, oa, apolis
    35073 ices, adors, aking, utor, oth, ▁pos
    35170 <s>
    35190 FI, ister, ▁Time, amental, dy, af
    35197 <s>, <0x0A>, ▁reserved, ▁marriage, HD, ▁representative
    35222 rate
    35270 ▁drive, ▁driving, ▁vote, ▁purchase, UI, ▁marry
    35282 iger, ellow
    35319 ▁mind, ▁thinking, ▁composed, ▁changed, ▁learned, ▁ideas
    35355 ▁Cl, inton, ▁Pitt, ▁Moore, ▁Campbell, immer
    35512 ▁between, ▁distinction, ▁mixed, ▁Among, ▁behind, ▁change
    35637 inos
    35643 ▁reflection
    35697 <s>
    35726 ▁cool, der, ▁shorter, ▁mil, ▁smaller, mer
    35727 orney, anned, wed, ▁lawyer, ▁illegal, ▁allowed
    35813 ▁Q, ▁Lanc, ▁Spart, ▁nearly, ▁sand, ▁Fre
    35815 ▁believe, ▁seen, ▁learned, ▁knows, ▁admit, ▁suspect
    35823 ▁November, night, ▁Sund, /, ellow, oles
    35880 ., %., .", )., "., .
    35896 here, ▁Nothing, ▁nothing, ▁no, ▁anything, ▁comment
    35903 ▁evolution, vement, aked, ▁Order, ▁God, ▁controlled
    36029 ▁of, ▁processes, ▁costs, ▁Of
    36054 ▁anyone, ▁individuals, ▁owner, ▁carry, ▁tries, ▁holder
    36078 <s>
    36114 ▁Greek
    36119 ▁involve, ▁becoming, ▁being, ▁be, ▁identify, ▁represent
    36240 XT
    36340 All, ▁That
    36360 ▁over, ▁Over, ▁since, ▁stayed, ▁among
    36361 1
    36450 ▁days, ▁hours, ▁week, ▁longer
    36498 <s>
    36564 some, ively, elling, ously, ately, ▁great
    36600 <s>
    36660 ▁parents, ents
    36677 ▁occurs, ▁holding, ▁About, ▁built, ▁near
    36705 ▁eu, ▁Eu, ▁verte, ▁ap, ▁Lanc, ▁AT
    36806 ▁follow, ▁pushing, ▁share, ▁visited, ▁speak, ▁treated
    36818 <s>
    36927 ., ▁Hun, ▁Ro, eld, ▁Per, ▁Rum
    36955 ▁president, iden, ▁election
    36993 ▁reflection
    37024 ▁turns, ▁Part, ▁sometimes, ▁where, ▁led, ▁Mother
    37027 <0x0A>, ▁hasn, ▁aren, ▁isn
    37090 rew, ▁spoken, ▁Portuguese, ▁speak, ▁language, ▁Spanish
    37103 ▁aircraft, ils, ▁sky, ▁left, plane, ▁liquid
    37155 <s>
    37178 /
    37331 /, ▁attacks, ▁attempt, eda, ▁terror, ▁attacked
    37357 ▁returns, ▁Building, ▁bed, ▁television, lla, ▁pushing
    37405 <s>
    37454 <s>, <0x0A>
    37656 ll, re, s, m, ▁lots, ▁grad
    37722 ▁All, ▁Rain, ▁Only, ▁stretch, ▁Type, ▁mention
    37829 aten, ▁near, ▁onto, ▁among, ▁beat, ▁against
    37836 <s>
    37876 ▁Seven
    37879 ▁attacks, ▁cars
    37889 4, 5, 3, 6, 7, 8
    37957 ▁evidence, ▁demonstrate, ▁suggests, ▁shows, ▁showing, ▁weak
    37981 :
    38011 ▁teacher, ▁Hero, ▁Arizona, zen, ige, ▁reserved
    38041 ▁allows
    38070 ▁earlier, date, ▁useful, ▁faster, ▁win, ▁domin
    38076 ▁formed, ▁comes, ▁originally, ▁began, ▁origin, ▁unknown
    38141 6, 5, 7, 4, 3, 8
    38188 men
    38222 ved
    38245 ▁least, ▁mile, ▁square, ▁approximately, ▁thirty, ▁below
    38249 ▁With, ▁During, ▁By, ▁In, ▁On, ▁Among
    38297 well, burg, ell, mann, alem, bla
    38394 ▁comment, ▁unclear, ▁unknown, ▁depends, ▁specify, ▁ambigu
    38411 <s>
    38427 ▁fans, inos, ▁dogs, ▁artists, ▁news, ▁Jews
    38443 ze, ▁sink, rown, itable, ▁float, iled
    38513 )
    38546 ▁entrepr, pr, rane
    38609 ▁holder, ▁owner, ▁blood, ▁HT, ▁Blood, ▁type
    38621 ▁pow, der
    38632 ▁navig, ▁Tw, ▁expect, za, ▁ash, ▁cas
    38643 <s>
    38658 ▁Yes
    38681 ▁in, ▁crit, ▁inside, ▁Natural, here
    38765 ▁Way, S, ▁world, ▁city, ties
    38832 cs, ▁Greece, ▁Uruguay, ray, ▁Holland, ▁Argentina
    38841 <s>
    38849 ▁marks, ▁phase
    38865 ▁mode, ▁university, ▁Name, ▁city, ▁team, elf
    38867 ▁millions, ▁weeks, ▁orange, ▁injured, ▁wounded, ible
    38894 ▁ago, 9, ▁since, hood, ges, ieval
    39033 rows, keys, ds, pes, eras, ads
    39049 A
    39169 athol, ▁Afr, ali, ▁Bulg, ▁Polish, ▁Italian
    39188 ▁People, ▁people, Men, ▁populated, ▁someone, ▁population
    39346 edy
    39414 XT
    39489 3, 2, 4, 5, 6, Q
    39492 ▁Wel, sh, ael, rew, ▁Scottish, ▁Heb
    39545 rial, rest, ▁extr, osex, ▁prime, ▁ment
    39629 ▁Steve, vis, ▁Bern, ▁Donald, ary, ▁Hill
    39635 <0x0A>
    39741 )., ., '., .", %., ".
    39763 :, ▁(, A, ▁shares, ▁remember, ancy
    39804 ▁Associ, ▁Form, ▁Rel, ▁With, ▁During, ▁Pay
    39843 <s>
    39880 ▁moon, rin, ▁land, ▁Space, strong, ▁landing
    39922 ▁University
    40019 pected, ▁recent, ▁current, ▁election, ▁expedition, ▁purchase
    40031 ?", "?, "., ',, '.
    40053 ▁Zealand, ▁Netherlands, ▁Spain, otion, ▁fingers, ▁Sports
    40055 ▁Burg, ▁pos, rapper, ▁Dom, nt, ey
    40063 acc, MR, estic, ination, iot, ▁abort
    40101 ▁rise, ▁keeps, ▁continue, ▁keep, ▁rising, ▁going
    40220 ▁Yes, ▁No
    40237 ros, ▁Manh, ▁eu, ▁Venezuela, ▁Chile, ▁Eu
    40253 ▁disag, ree, ▁agree, anim, ▁differ, ▁distinction
    40289 (
    40339 ▁harder, ▁stayed, ▁got, ▁consist, ▁become, ▁became
    40381 .,, ',, ,, ▁so, ▁then, ▁requires
    40410 ▁your, ▁my, ▁yours, ▁us, ▁Your, ▁husband
    40444 ▁originally
    40474 ▁average, verage, ita, ▁total, ▁per
    40537 rian, ▁Aust, ▁Australian, ▁European, ▁Scottish, otion
    40608 ▁tea, ▁coffee, ef, od, ▁guns, ations
    40645 )?, ?", "?, ?, ▁tou, ',
    40670 <s>
    40740 ▁Zealand, ▁Australia, ▁Canada, ▁Netherlands, ▁Singapore, ▁Britain
    40858 ▁entrepr, pr, rane
    40860 ▁named, ▁born, ▁contract, ▁inside, ▁through, ▁in
    40947 4, 5, 3, 6, 7, 2
    40950 ▁earlier
  9. ^

    I've noticed that as you push sparsity too low on GPT-2 or Llama-2 7B autoencoders, the autoencoders tend to increasingly fixate on particular tokens. With GPT-2, that token happens to be esthetic. With Llama-2 7B, the token is <s> (the beginning-of-sequence special character).

    As an example, this .csv contains logged results for a Llama-2 7B layer 7 autoencoder with .

7 comments

Comments sorted by top scores.

comment by LawrenceC (LawChan) · 2023-09-23T19:19:14.855Z · LW(p) · GW(p)

We train such an autoencoder to convergence, driving towards an 

This is a typo right? IT should say L^1

Replies from: David Udell
comment by David Udell · 2023-09-23T19:20:15.673Z · LW(p) · GW(p)

No, towards an  value.  is the training proxy for that, though.

Replies from: LawChan
comment by LawrenceC (LawChan) · 2023-09-23T19:21:48.098Z · LW(p) · GW(p)

Oh, okay, makes sense.

comment by Aidan Ewart (baidicoot) · 2023-09-27T17:20:28.297Z · LW(p) · GW(p)

Hi David, co-author of the 'Sparse Autoencoders Find Highly Interpretable Directions in Language Models [LW · GW]' paper here,
I think this might be of interest to you:
We are currently in the process of re-framing section 4 of the paper to focus more on model steering & activation editing; in line with what you hypothesise, we find that editing a small number of relevant features on e.g. the IOI task can steer the model from its predictions on one token to its predictions on a counterfactual token.

comment by Charlie Steiner · 2023-09-23T23:22:47.624Z · LW(p) · GW(p)

I'm not very enlightened by what tokens most excite the component directions in a vacuum. Interpreting text models is hard.

Maybe something like network dissection could work? What I'd want is a dataset of text samples labeled by properties that you want to find features to track.

E.g. suppose you want features that track "calm text" vs. "upset text." Then you want each snippet labeled as either calm or upset - or even better, you could collect a squiggly curve for how "calm" vs. "upset" labelers think the text is around any given token (maybe by showing them shorter snippets and then combining them into longer ones, or maybe by giving them a UI that lets then change levels of different features as changes happen in the text). And then you look for features that track that coarse-grained property of the text - that vary on a long timescale, in ways correlated with the variation of how calm/upset the text seems to humans.

And then you do that for a dozen or a gross long-term properties of text you think you might find features of.

Replies from: David Udell
comment by David Udell · 2023-09-26T00:47:17.588Z · LW(p) · GW(p)

I agree that stronger, more nuanced interpretability techniques should tell you more. But, when you see something like, e.g.,

25132 ▁vs, ▁differently, ▁compared, ▁greater, all, ▁per
25134 ▁I, ▁My, I, ▁personally

isn't it pretty obvious what those two autoencoder neurons were each doing?

Replies from: Charlie Steiner
comment by Charlie Steiner · 2023-09-26T01:23:33.212Z · LW(p) · GW(p)

It does seem obvious[1], but I think this can easily be misleading. Are these activation directions always looking for these tokens regardless of context, or are they detecting the human-obvious theme they seem to be gesturing towards, or are they playing a more complicated functional role that merely happens to be activated by those tokens in the first position?

E.g. Is the "▁vs, ▁differently, ▁compared" direction just a brute detector for those tokens? Or is it a more general detector for comparison and counting that would have rich but still human-obvious behavior on longer snippets? Or is it part of a circuit that needs to detect comparison words but is actually doing something totally different like completing discussions about shopping lists?

  1. ^

     certainly more so than 

    31892 ▁she, bian, ▁recently, ▁means, ▁Because, ▁experienced