Posts

Efficient Dictionary Learning with Switch Sparse Autoencoders 2024-07-22T18:45:53.502Z

Comments

Comment by Anish Mudide (anish-mudide) on Efficient Dictionary Learning with Switch Sparse Autoencoders · 2024-07-23T20:54:56.748Z · LW · GW

Thanks for your comment! I believe your concern was echoed by Lee and Arthur in their comments and is completely valid. This work is primarily a proof-of-concept that we can successfully scale SAEs by directly applying MoE, but I suspect that we will need to make tweaks to the architecture.

Comment by Anish Mudide (anish-mudide) on Efficient Dictionary Learning with Switch Sparse Autoencoders · 2024-07-23T20:47:09.591Z · LW · GW

Yes, you can train a Switch SAE in any scenario where you can train a standard SAE. @hugofry has a nice blog post on training SAEs for ViT.

Comment by Anish Mudide (anish-mudide) on Efficient Dictionary Learning with Switch Sparse Autoencoders · 2024-07-23T20:44:19.853Z · LW · GW

Thanks for the question --  is calculated over an entire batch of inputs, not a single . Figure 1 shows how the Switch SAE processes a single residual stream activation .

Comment by Anish Mudide (anish-mudide) on Efficient Dictionary Learning with Switch Sparse Autoencoders · 2024-07-23T20:32:30.119Z · LW · GW

Hi Lee and Arthur, thanks for the feedback! I agree that routing to a single expert will force redundant features and will experiment with Arthur's suggestion. I haven't taken a close look at the router/expert geometry yet but plan to do so soon. 

Comment by Anish Mudide (anish-mudide) on Efficient Dictionary Learning with Switch Sparse Autoencoders · 2024-07-23T20:17:19.631Z · LW · GW

Thanks for the comment -- I trained TopK SAEs with various widths (all fitting within a single GPU) and observed wider SAEs take substantially longer to train, which leads me to believe that the encoder forward pass is a major bottleneck for wall-clock time.  The Switch SAE also improves memory efficiency because we do not need to store all  latents.

I'm currently working on implementing expert-parallelism, which I hope will lead to substantial improvements to wall-clock time.