Posts

JumpReLU SAEs + Early Access to Gemma 2 SAEs 2024-07-19T16:10:54.664Z
Improving Dictionary Learning with Gated Sparse Autoencoders 2024-04-25T18:43:47.003Z
[Full Post] Progress Update #1 from the GDM Mech Interp Team 2024-04-19T19:06:59.185Z
[Summary] Progress Update #1 from the GDM Mech Interp Team 2024-04-19T19:06:17.755Z
AtP*: An efficient and scalable method for localizing LLM behaviour to components 2024-03-18T17:28:37.513Z
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla 2023-07-20T10:50:58.611Z
A Mechanistic Interpretability Analysis of Grokking 2022-08-15T02:41:36.245Z
Investigating causal understanding in LLMs 2022-06-14T13:57:59.430Z
Thoughts on Formalizing Composition 2022-06-07T07:51:21.199Z
Understanding the tensor product formulation in Transformer Circuits 2021-12-24T18:05:53.697Z
How should my timelines influence my career choice? 2021-08-03T10:14:33.722Z

Comments

Comment by Tom Lieberum (Frederik) on JumpReLU SAEs + Early Access to Gemma 2 SAEs · 2024-07-25T15:56:45.782Z · LW · GW

We use 1024, though often article snippets are shorter than that so they are separated by BOS.

Comment by Tom Lieberum (Frederik) on Decomposing the QK circuit with Bilinear Sparse Dictionary Learning · 2024-07-03T15:40:02.625Z · LW · GW

Cool work!

Did you run an ablation on the auxiliary losses for  and  , how important was that to stabilize training?

Did you compare to training separate Q and K SAEs via typical reconstruction loss? Would be cool to see a side-by-side comparison, i.e. how large the benefit of this scheme is. 

Comment by Tom Lieberum (Frederik) on Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla · 2023-07-21T11:39:38.433Z · LW · GW

During parts of the project I had the hunch that some letter specialized heads are more like proto-correct-letter-heads (see paper for details), based on their attention pattern. We never investigated this, and I think it could go either way. The "it becomes cleaner" intuition basically relies on stuff like the grokking work and other work showing representations being refined late during training by.. Thisby et al. I believe (and maybe other work). However some of this would probably require randomising e.g. the labels the model sees during training. See e.g. Cammarata et al. Understanding RL Vision: If you only ever see the second choice be labeled with B you don't have an incentive to distinguish between "look for B" and "look for the second choice". Lastly, even in the limit of infinite training data you still have limited model capacity and so will likely use a distributed representation in some way, but maybe you could at least get human interpretable features even if they are distributed.

Comment by Tom Lieberum (Frederik) on We Found An Neuron in GPT-2 · 2023-02-12T21:41:33.089Z · LW · GW

Yup! I think that'd be quite interesting. Is there any work on characterizing the embedding space of GPT2?

Comment by Tom Lieberum (Frederik) on We Found An Neuron in GPT-2 · 2023-02-12T15:09:07.058Z · LW · GW

Nice work, thanks for sharing! I really like the fact that the neurons seem to upweight different versions of the same token (_an, _An, an, An, etc.). It's curious because the semantics of these tokens can be quite different (compared to the though, tho, however neuron).

 

Have you looked at all into what parts of the model feed into (some of) the cleanly associated neurons? It was probably out of scope for this but just curious.

Comment by Tom Lieberum (Frederik) on Tracr: Compiled Transformers as a Laboratory for Interpretability | DeepMind · 2023-01-15T10:05:22.825Z · LW · GW

(The quote refers to the usage of binary attention patterns in general, so I'm not sure why you're quoting it)

I obv agree that if you take the softmax over {0, 1000, 2000}, you will get 0 and 1 entries.

iiuc, the statement in the tracr paper is not that you can't have attention patterns which implement this logical operation, but that you can't have a single head implementing this attention pattern (without exponential blowup) 

Comment by Tom Lieberum (Frederik) on Tracr: Compiled Transformers as a Laboratory for Interpretability | DeepMind · 2023-01-14T20:09:41.642Z · LW · GW

I don't think that's right. Iiuc this is a logical and, so the values would be in {0, 1} (as required, since tracr operates with Boolean attention). For a more extensive discussion of the original problem see appendix C.

Comment by Tom Lieberum (Frederik) on [Interim research report] Taking features out of superposition with sparse autoencoders · 2022-12-16T16:46:54.387Z · LW · GW

Meta-q: Are you primarily asking for better assumptions or that they be made more explicit?

I would be most interested in an explanation for the assumption that is grounded in the distribution you are trying to approximate. It's hard to tell which parts of the assumptions are bad without knowing (which properties of) the distribution it's trying to approximate or why you think that the true distribution has property XYZ.

Re MLPs: I agree that we ideally want something general but it looks like your post is evidence that something about the assumptions is wrong and doesn't transfer to MLPs, breaking the method. So we probably want to understand better what about the assumptions don't hold there. If you have a toy model that better represents the true dist then you can confidently iterate on methods via the toy model.

Undertrained autoencoders

I was actually thinking of the LM when writing this but yeah the autoencoder itself might also be a problem. Great to hear you're thinking about that.

Comment by Tom Lieberum (Frederik) on [Interim research report] Taking features out of superposition with sparse autoencoders · 2022-12-16T16:38:34.032Z · LW · GW

(ETA to the OC: the antipodal pairs wouldn't happen here due to the way you set up the data generation, but if you were to learn the features as in the toy models post, you'd see that. I'm now less sure about this specific argument)

Comment by Tom Lieberum (Frederik) on [Interim research report] Taking features out of superposition with sparse autoencoders · 2022-12-16T13:31:36.478Z · LW · GW

Thanks for posting this. Some comments/questions we had after briefly discussing it in our team:

  • We would have loved to see more motivation for why you are making the assumptions you are making when generating the toy data.
    • Relatedly, it would be great to see an analysis of the distribution of the MLP activations. This could give you some info where your assumptions in the toy model fall short.
  • As Charlie Steiner pointed out, you are using a very favorable ratio of  in the toy model , i.e. of number of ground truth features to encoding dimension. I would expect you will mostly get antipodal pairs in that setup, rather than strongly interfering superposition. This may contribute significantly to the mismatch. (ETA: the antipodal pairs wouldn't happen here due to the way you set up the data generation, but if you were to learn the features as in the toy models post, you'd see that. I'm now less sure about this specific argument)
  • For the MMCS plots, we would be interested in seeing the distribution/histogram of MCS values. Especially for ~middling MCS values, where it's not clear if all features are somewhat represented or some are a lot and some not at all.
  • While we don't think this has a big impact compared to the other potential mismatches between toy model and the MLP, we do wonder whether the model has the parameters/data/training steps it needs to develop superposition of clean features.
    • e.g. in the toy models report, Elhage et al. reported phase transitions of superposition over the course of training,
Comment by Tom Lieberum (Frederik) on Taking the parameters which seem to matter and rotating them until they don't · 2022-09-01T14:37:03.051Z · LW · GW

Yeah I agree with that. But there is also a sense in which some (many?) features will be inherently sparse.

  • A token is either the first one of multi-token word or it isn't.
  • A word is either a noun, a verb or something else.
  • A word belongs to language LANG and not to any other language/has other meanings in those languages.
  •  image can only contain so many objects which can only contain so many sub-aspects.

I don't know what it would mean to go "out of distribution" in any of these cases.

This means that any network that has an incentive to conserve parameter usage (however we want to define that), might want to use superposition.

Comment by Tom Lieberum (Frederik) on Taking the parameters which seem to matter and rotating them until they don't · 2022-09-01T14:31:18.991Z · LW · GW

Do superposition features actually seem to work like this in practice in current networks? I was not aware of this.

I'm not aware of any work that identifies superposition in exactly this way in NNs of practical use. 
As Spencer notes, you can verify that it does appear in certain toy settings though. Anthropic notes in their SoLU paper that they view their results as evidence for the SPH in LLMs. Imo the key part of the evidence here is that using a SoLU destroys performance but adding another LayerNorm afterwards solves that issue. The SoLU selects strongly against superposition and LayerNorm makes it possible again, which is some evidence that the way the LLM got to its performance was via superposition.

 

ETA: Ofc there could be some other mediating factor, too.

Comment by Tom Lieberum (Frederik) on Taking the parameters which seem to matter and rotating them until they don't · 2022-09-01T13:43:30.326Z · LW · GW

This example is meant to only illustrate how one could achieve this encoding. It's not how an actual autoencoder would work. An actual NN might not even use superposition for the data I described and it might need some other setup to elicit this behavior.
But to me it sounded like you are sceptical that superposition is nothing but the network being confused whereas I think it can be the correct way to still be able to reconstruct the features to a reasonable degree.

Comment by Tom Lieberum (Frederik) on Taking the parameters which seem to matter and rotating them until they don't · 2022-09-01T13:09:20.374Z · LW · GW

Ah, I might have misunderstood your original point then, sorry! 

I'm not sure what you mean by "basis" then. How strictly are you using this term?

I imagine you are basically going down the "features as elementary unit" route proposed in Circuits (although you might not be pre-disposed to assume features are the elementary unit).Finding the set of features used by the network and figuring out how its using them in its computations does not 1-to-1 translate to "find the basis the network is thinking in" in my mind.

Comment by Tom Lieberum (Frederik) on Taking the parameters which seem to matter and rotating them until they don't · 2022-09-01T13:02:40.643Z · LW · GW

Possibly the source of our disagreement here is that you are imagining the neuron ought to be strictly monotonically increasing in activation relative to the dog-headedness of the image?

If we abandon that assumption then it is relatively clear how to encode two numbers in 1D. Let's assume we observe two numbers . With probability , and with probability 

We now want to encode these two events in some third variable , such that we can perfectly reconstruct  with probability .

I put the solution behind a spoiler for anyone wanting to try it on their own.

Choose some veeeery large  (much greater than the variance of the normal distribution of the features). For the first event, set . For the second event, set .

The decoding works as follows:

If  is negative, then with probability  we are in the first scenario and we can set . Vice versa if  is positive.

Comment by Tom Lieberum (Frederik) on Taking the parameters which seem to matter and rotating them until they don't · 2022-09-01T10:17:03.592Z · LW · GW

I'd say that there is a basis the network is thinking in in this hypothetical, it would just so happens to not match the human abstraction set for thinking about the problem in question.

Well, yes but the number of basis elements that make that basis human interpretable could theoretically be exponential in the number of neurons.

Comment by Tom Lieberum (Frederik) on Taking the parameters which seem to matter and rotating them until they don't · 2022-09-01T10:15:20.694Z · LW · GW

If due to superposition, it proves advantageous to the AI to have a single feature that kind of does dog-head-detection and kind of does car-front-detection, because dog heads and car fronts don't show up in the training data at the same time, so it can still get perfect loss through a properly constructed dual-purpose feature like this, it'd mean that to the AI, dog heads and car fronts are "the same thing".

I don't think that's true. Imagine a toy scenario of two features that run through a 1D non-linear bottleneck before being reconstructed. Assuming that with some weight settings you can get superposition, the model is able to reconstruct the features ≈perfectly as long as they don't appear together. That means the model can still differentiate the two features, they are different in the model's ontology.

As AIs get more capable and general, I'd expect the concepts/features they use to start more closely matching the ones humans use in many domains.

My intuition disagrees here too. Whether we will observe superposition is a function of (number of "useful" features in the data), (sparsity of said features), and something like (bottleneck size). It's possible that bottleneck size will never be enough to compensate for number of features. Also it seems reasonable to me that ≈all of reality is extremely sparse in features, which presumably favors superposition.

Comment by Tom Lieberum (Frederik) on Taking the parameters which seem to matter and rotating them until they don't · 2022-08-29T20:06:22.350Z · LW · GW

I agree that all is not lost wrt sparsity and if SPH turns out to be true it might help us disentangle the superimposed features to better understand what is going on. You could think of constructing an "expanded" view of a neural network. The expanded view would allocate one neuron per feature and thus has sparse activations for any given data point and would be easier to reason about. That seems impractical in reality, since the cost of constructing this view might in theory be exponential, as there are exponentially many "almost orthogonal" vectors for a given vector space dimension, as a function of the dimension.

I think my original comment was meant more as a caution against the specific approach of "find an interpretable basis in activation space", since that might be futile, rather than a caution against all attempts at finding a sparse representation of the computations that are happining within the network.

Comment by Tom Lieberum (Frederik) on Taking the parameters which seem to matter and rotating them until they don't · 2022-08-29T20:02:00.544Z · LW · GW

I don't think there is anything on that front other than the paragraphs in the SoLU paper. I alluded to a possible experiment for this on Twitter in response to that paper but haven't had the time to try it out myself: You could take a tiny autoencoder to reconstruct some artificially generated data where you vary attributes such as sparsity, ratio of input dimensions vs. bottleneck dimensions, etc. You could then look at the weight matrices of the autoencoder to figure out how it's embedding the features in the bottleneck and which settings lead to superposition, if any.

Comment by Tom Lieberum (Frederik) on Taking the parameters which seem to matter and rotating them until they don't · 2022-08-29T18:44:22.950Z · LW · GW

I disagree with your intuition that we should not expect networks at irreducible loss to not be in superposition.

The reason I brought this up is that there are, IMO, strong first-principle reasons for why SPH should be correct. Say there are two features, which have an independent probability of 0.05 to be present in a given data point, then it would be wasteful to allocate a full neuron to each of these features. The probability of both features being present at the same time is a mere 0.00025. If the superposition is implemented well you get basically two features for the price of one with an error rate of 0.025%. So if there is even a slight pressure towards compression, e.g. by having less available neurons than features, then superposition should be favored by the network.

Now does this toy scenario map to reality? I think it does, and in some sense it is even more favorable to SPH since often the presence of features will be anti-correlated. 

Comment by Tom Lieberum (Frederik) on Taking the parameters which seem to matter and rotating them until they don't · 2022-08-27T07:15:19.607Z · LW · GW

Interesting idea! 

What do you think about the Superposition Hypothesis? If that were true, then at a sufficient sparsity of features in the input there is no basis in which the network is thinking in, meaning it will be impossible to find a rotation matrix that allows for a bijective mapping between neurons and features.

I would assume that the rotation matrix that enables local changes via the sparse Jacobian coincides with one which maximizes some notion of "neuron-feature-bijectiveness". But as noted above that seems impossible if the SPH holds.

Comment by Tom Lieberum (Frederik) on A Mechanistic Interpretability Analysis of Grokking · 2022-08-19T13:40:57.551Z · LW · GW

K-composition as a concept was introduced by Anthropic in their work on Transformer Circuits in the initial post. In general, the output of an attention head in an earlier layer can influence the query, key, or value computation of an attention head in a later layer. 

K-composition refers to the case in which the key-computation is influenced. In a model without nonlinearities or layernorms you can do this simply by looking at how strongly the output matrix of head 1 and the key matrix of head 2 compose (or more precisely, by looking at the frobenius norm of the product relative to the product of the individual norms). I also tried to write a bit about it here.

Comment by Tom Lieberum (Frederik) on Two-year update on my personal AI timelines · 2022-08-08T11:55:09.596Z · LW · GW

Thanks for verifying! I retract my comment.

Comment by Tom Lieberum (Frederik) on Two-year update on my personal AI timelines · 2022-08-03T21:33:52.464Z · LW · GW

I think historically reinforcement has been used more in that particular constellation (see eg deep RL from HP paper) but as I noted I find reward learning more apt as it points to the hard thing being the reward learning, i.e. distilling human feedback into an objective, rather than the optimization of any given reward function (which technically need not involve reinforcement learning)

Comment by Tom Lieberum (Frederik) on Two-year update on my personal AI timelines · 2022-08-03T21:28:01.532Z · LW · GW

Well I thought about that but I wasn't sure whether reinforcement learning from human feedback wouldn't be just a strict subset of reward learning from human feedback. If reinforcement is indeed the strict definition then I concede but I dont think it makes sense.

Comment by Tom Lieberum (Frederik) on Two-year update on my personal AI timelines · 2022-08-03T13:05:49.441Z · LW · GW

Reward Learning from Human Feedback

Comment by Tom Lieberum (Frederik) on chinchilla's wild implications · 2022-07-31T12:39:09.558Z · LW · GW

Thanks for your reply! I think I basically agree with all of your points. I feel a lot of frustration around the fact that we don't seem to have adequate infohazard policies to address this. It seems like a fundamental trade-off between security and openness/earnestness of discussion does exist though. 

It could be the case that this community is not the correct place to enforce this rules, as there does still exist a substantial gap between "this thing could work" and "we have a working system". This is doubly true in DL where implementation details matter a great deal.

Comment by Tom Lieberum (Frederik) on chinchilla's wild implications · 2022-07-31T08:01:18.588Z · LW · GW

I'd like to propose not talking publicly about ways to "fix" this issue. Insofar these results spell trouble for scaling up  LLMs, this is a good thing! 
Infohazard (meta-)discussions are thorny by their very nature and I don't want to discourage discussions around these results in general, e.g. how to interpret them or whether the analysis has merits. 

Comment by Tom Lieberum (Frederik) on Race Along Rashomon Ridge · 2022-07-28T10:03:36.831Z · LW · GW

If the subset  of interpretable models is also "nice" in the differential-geometric sense (say, also a smooth submanifold of ), then the intersection  is also similarly "nice."

 

Do you have any intuition for why we should expect  to be "nice"? I'm not super familiar with differential geometry but I don't really see why this should be the case..

Comment by Tom Lieberum (Frederik) on Which singularity schools plus the no singularity school was right? · 2022-07-22T14:41:48.196Z · LW · GW

This assumes a fixed scaling law. One possible way of improving oneself could be to design a better architecture with a better scaling exponent.

Comment by Tom Lieberum (Frederik) on A note about differential technological development · 2022-07-15T19:10:55.507Z · LW · GW

Thanks for elaborating! In so far your assessment is based on in-person interactions, I can't really comment since I haven't spoken much with people from Anthropic.

I think there are degrees to believing this meme you refer to, in the sense of "we need an AI of capability level X to learn meaningful things". And I would guess that many people at Anthropic do believe this weaker version -- it's their stated purpose after all. And for some values of X this statement is clearly true, e.g. learned filters by shallow CNNs trained on MNIST are not interpretable, wheras the filters of deep Inception-style CNNs trained on ImageNet are (mostly) interpretable.

One could argue that parts of interpretabillity do need to happen in a serial manner, e.g. finding out the best way to interpret transformers at all, the recent SoLU finding, or just generally building up knowledge on how to best formalize or go about this whole interpretability business. If that is true, and furthermore interpretability turns out to be an important component in promising alignment proposals, then the question is mostly about what level of X gives you the most information to advance the serial interpretability research in terms of how much other serial budget you burn.

I don't know whether people at Anthropic believe the above steps or have thought about it in these ways at all but if they did this could possibly explain the difference in policies between you and them?

Comment by Tom Lieberum (Frederik) on A note about differential technological development · 2022-07-15T13:29:32.355Z · LW · GW

I'd also be interested in hearing which parts of Anthropic's research output you think burns our serial time budget. If I understood the post correctly, then OP thinks that efforts like transformer circuits are mostly about accelerating parallelizable research.

Maybe OP thinks that

  • mechanistic interpretability does have little value in terms of serial research
  • RLHF does not give us alignment (because it doesn't generalize beyond the "sharp left turn" which OP thinks is likely to happen)
  • therefore, since most of Anthropic's alignment focused output has not much value in terms of serial research, and it does somewhat enhance present-day LLM capabilities/usability, it is net negative?

But I'm very much unsure whether OP really believes this -- would love to hear him elaborate.

ETA: It could also be the case that OP was exclusively referring to the part of Anthropic that is about training LLMs efficiently as a pre-requisite to study those models?

Comment by Tom Lieberum (Frederik) on How do I use caffeine optimally? · 2022-06-23T04:34:01.853Z · LW · GW

Yep all good points. I think I didn't emphasize enough that you should not take it every day (maybe not even every other day).

The gums are less addictive than cigs because they taste bad and because the feedback/reinforcement is slower. Lozenges sound like a good alternative too, to be extra sure.

Comment by Tom Lieberum (Frederik) on How do I use caffeine optimally? · 2022-06-22T18:46:17.696Z · LW · GW

I wouldn't recommend regular caffeine at all unless you know from experience that you won't develop a physical dependency. In my experience you get more like short term gain until your body adapts then requires coffee to function normally.

If you do want to try caffeine I recommend trying to pair it with L-theanine (either in pills or green tea) which is supposed to smooth the experience and makes for a cleaner high (YMMV).

If you're looking for a stimulant that you don't take regularly and with shorter half life, consider nicotine gums. Again ymmv, I think gwern has tried it with little effect. Beware the addictive potential (although lower than with cigarettes or vapes)

Comment by Tom Lieberum (Frederik) on CNN feature visualization in 50 lines of code · 2022-05-26T13:02:48.826Z · LW · GW

On priors, I wouldn't worry too much about c), since I would expect a 'super stimulus' for head A to not be a super stimulus for head B.

I think one of the problems is the discrete input space, i.e. how do you parameterize sequence that is being optimized?

One idea I just had was trying to fine-tune an LLM with a reward signal given by for example the magnitude of the residual delta coming from a particular head (we probably something else here, maybe net logit change?). The LLM then already encodes a prior over "sensible" sequences and will try to find one of those which activates the head strongly (however we want to operationalize that).

Comment by Tom Lieberum (Frederik) on CNN feature visualization in 50 lines of code · 2022-05-26T12:43:30.179Z · LW · GW

Very cool to see new people joining the interpretability field!

Some resource suggestions:

If you didn't know already, there is a TF2 port of Lucid, called Luna:

There is also Lucent, which is Lucid for PyTorch: (Some docs written by me for a slightly different version)

For transformer interpretability you might want to check out Anthropic's work on transformer circuits, Redwood Research's interpretability tool, or (shameless plug) Unseal.

Comment by Tom Lieberum (Frederik) on DeepMind is hiring for the Scalable Alignment and Alignment Teams · 2022-05-13T14:11:55.738Z · LW · GW

I can't speak to the option for remote work but as a counterpoint, it seems very straightforward to get a UK visa for you and your spouse/children (at least straightforward relative to the US). The relevant visa to google is the Skilled Worker / Tier 2 visa if you want to know more.

ETA: Of course, there are still legitimate reasons for not wanting to move. Just wanted to point out that the legal barrier is lower than you might think.

Comment by Tom Lieberum (Frederik) on Hoagy's Shortform · 2022-04-25T09:38:51.082Z · LW · GW

There is definitely something out there, just can't recall the name. A keyword you might want to look for is "disentangled representations".

One start would be the beta-VAE paper https://openreview.net/forum?id=Sy2fzU9gl

Comment by Tom Lieberum (Frederik) on Replacing Karma with Good Heart Tokens (Worth $1!) · 2022-04-01T09:16:31.659Z · LW · GW

Considering you get at least one free upvote from posting/commenting itself, you just have to be faster than the downvoters to generate money :P

Comment by Tom Lieberum (Frederik) on Gears-Level Mental Models of Transformer Interpretability · 2022-03-31T13:08:47.069Z · LW · GW

Small nitpick:

The PCA plot is using the smallest version of GPT2, and not the 1.5B parameter model (that would be GPT2-XL). The small model is significantly worse than the large one and so I would be hesitant to draw conclusions from that experiment alone.

Comment by Tom Lieberum (Frederik) on Do a cost-benefit analysis of your technology usage · 2022-03-31T09:27:01.515Z · LW · GW

I want to second your first point. Texting frequently with significant others lets me feel be part of their life and vice versa which a weekly call does not accomplish, partly because it is weekly and partly because I am pretty averse to calls. 

In one relationship I had, this led to significant misery on my part because my partner was pretty strict on their phone usage, batching messages for the mornings and evenings. For my current primary relationship, I'm convinced that the frequent texting is what kept it alive while being long-distance. 

To reconcile the two viewpoints, I think it is still true that superficial relationships via social media likes or retweets are not worth that much if they are all there is to the relationship. But direct text messages are a significant improvement on that. 

Re your blog post:
Maybe that's me being introverted but there are probably significant differences in whether people feel comfortable/like texting or calling. For me, the instantaneousness of calling makes it much more stressful, and I do have a problem with people generalizing either way that one way to interact over distances is superior in general. I do cede the point that calling is of course much higher bandwidth, but it also requires more time commitment and coordination. 

Comment by Tom Lieberum (Frederik) on Hypothesis: gradient descent prefers general circuits · 2022-02-12T10:57:52.838Z · LW · GW

I tried increasing weight decay and increased batch sizes but so far no real success compared to 5x lr. Not going to investigate this further atm.

Comment by Tom Lieberum (Frederik) on Hypothesis: gradient descent prefers general circuits · 2022-02-11T13:45:30.014Z · LW · GW

Oh I thought figure 1 was S5 but it actually is modular division. I'll give that a go..

Here are results for modular division. Not super sure what to make of them. Small increases in learning rate work, but so does just choosing a larger learning rate from the beginning. In fact, increasing lr to 5x from the beginning works super well but switching to 5x once grokking arguably starts just destroys any progress. 10x lr from the start does not work (nor when switching later)

So maybe the initial observation is more a general/global property of the loss landscape for the task and not of the particular region during grokking?

Comment by Tom Lieberum (Frederik) on Hypothesis: gradient descent prefers general circuits · 2022-02-11T12:21:56.541Z · LW · GW

So I ran some experiments for the permutation group S_5 with the task x o y = ?

Interestingly here increasing the learning rate just never works. I'm very confused.

Comment by Tom Lieberum (Frederik) on Hypothesis: gradient descent prefers general circuits · 2022-02-11T10:50:24.945Z · LW · GW

I updated the report with the training curves. Under default settings, 100% training accuracy is reached after 500 steps.

There is actually an overlap between the train/val curves going up. Might be an artifact of the simplicity of the task or that I didn't properly split the dataset (e.g. x+y being in train and y+x being in val). I might run it again for a harder task to verify.

Comment by Tom Lieberum (Frederik) on Hypothesis: gradient descent prefers general circuits · 2022-02-11T10:40:04.696Z · LW · GW

Yep I used my own re-implementation, which somehow has slightly different behavior.

I'll also note that the task in the report is modular addition while figure 1 from the paper (the one with the red and green lines for train/val) is the significantly harder permutation group task.

Comment by Tom Lieberum (Frederik) on Hypothesis: gradient descent prefers general circuits · 2022-02-11T08:37:42.283Z · LW · GW

I'm not sure I understand.

I chose the grokking starting point as 300 steps, based on the yellow plot. I'd say it's reasonable to say that 'grokking is complete' by the 2000 step mark in the default setting, whereas it is complete by the 450 step mark in the 10x setting (assuming appropriate LR decay to avoid overshooting). Also note that the plots in the report are not log-scale

Comment by Tom Lieberum (Frederik) on Hypothesis: gradient descent prefers general circuits · 2022-02-10T20:03:56.070Z · LW · GW

It would be interesting to see if, once grokking had clearly started, you could just 100x the learning rate and speed up the convergence to zero validation loss by 100x.

I ran a quick-and-dirty experiment and it does in fact look like you can just crank up the learning rate at the point where some part of grokking happens to speed up convergence significantly. See the wandb report:

https://wandb.ai/tomfrederik/interpreting_grokking/reports/Increasing-Learning-Rate-at-Grokking--VmlldzoxNTQ2ODY2?accessToken=y3f00qfxot60n709pu8d049wgci69g53pki6pq6khsemnncca1dnmocu7a3d43y8

I set the LR to 5x the normal value (100x tanked the accuracy, 10x still works though). Of course you would want to anneal it after grokking was finished.

Comment by Tom Lieberum (Frederik) on Understanding the tensor product formulation in Transformer Circuits · 2021-12-28T14:12:35.906Z · LW · GW

Ah yes that makes sense to me. I'll modify the post accordingly and probably write it in the basis formulation.

ETA: Fixed now, computation takes a tiny bit longer but hopefully still readable to everyone.

Comment by Tom Lieberum (Frederik) on Should I delay having children to take advantage of polygenic screening? · 2021-12-19T11:33:46.679Z · LW · GW

Seems like this could be circumvented relatively easily by freezing gametes now.