Posts

Massive Activations and why <bos> is important in Tokenized SAE Unigrams 2024-09-05T02:19:25.592Z
Training a SAE in < 30 minutes on 16GB of VRAM using an S3 cache 2024-08-24T07:39:00.057Z
Faithful vs Interpretable Sparse Autoencoder Evals 2024-07-12T05:37:18.525Z

Comments

Comment by Louka Ewington-Pitsos (louka-ewington-pitsos) on Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers · 2024-09-06T23:15:21.002Z · LW · GW

I couldn't find a link to the code in the article so in case anyone else wants to try to replicate I think this is it: https://github.com/HugoFry/mats_sae_training_for_ViTs

Comment by Louka Ewington-Pitsos (louka-ewington-pitsos) on Efficient Dictionary Learning with Switch Sparse Autoencoders · 2024-08-19T04:06:46.566Z · LW · GW

Just to close the loop on this one, the official huggingface transformers library just uses a for-loop to achieve MoE. I also implemented a version myself using a for loop and it's much more efficient than either vanilla matrix multiplication or that weird batch matmul I write up there for large latent and batch sizes.

Comment by Louka Ewington-Pitsos (louka-ewington-pitsos) on Efficient Dictionary Learning with Switch Sparse Autoencoders · 2024-08-13T09:39:57.220Z · LW · GW

wait a minute... could you just...

you don't just literally do this do you?

input = torch.tensor([
    [1, 2],
    [1, 2],
    [1, 2],
]) # (bs, input_dim)


enc_expert_1 = torch.tensor([
    [1, 1, 1, 1],
    [1, 1, 1, 1],

])
enc_expert_2 = torch.tensor([
    [3, 3, 0, 0],
    [0, 0, 2, 0],
])



dec_expert_1 = torch.tensor([
    [ -1, -1],
    [ -1, -1],
    [ -1, -1],
    [ -1, -1],
])

dec_expert_2 = torch.tensor([
    [-10, -10,],
    [-10, -10,],
    [-10, -10,],
    [-10, -10,],

])

def moe(input, enc, dec, nonlinearity):
    input = input.unsqueeze(1)
    latent = torch.bmm(input, enc)

    recon = torch.bmm(nonlinearity(latent, dec))

    return recon.squeeze(1), latent.squeeze(1)


# not this! some kind of actual routing algorithm, but you end up with something similar
enc = torch.stack([enc_expert_1, enc_expert_2, enc_expert_1])
dec = torch.stack([dec_expert_1, dec_expert_2, dec_expert_1])

nonlinearity = torch.nn.ReLU()
recons, latent = moe(input, enc, dec, nonlinearity)

This must in some way be horrifically inefficient, right?

Comment by Louka Ewington-Pitsos (louka-ewington-pitsos) on Efficient Dictionary Learning with Switch Sparse Autoencoders · 2024-08-13T02:15:52.562Z · LW · GW

Can I ask what you used to implement the MOE routing? Did you use megablocks? I would love to expand on this research but I can't find any straightforward implementation of efficient pytorch MOE routing online.

Do you simply iterate over each max probability expert every time you feed in a batch? 

Comment by Louka Ewington-Pitsos (louka-ewington-pitsos) on Research Report: Alternative sparsity methods for sparse autoencoders with OthelloGPT. · 2024-06-30T00:33:27.143Z · LW · GW

This is dope, thank you for your service.  Also, can you hit us with your code on this one? Would love to reproduce.