notrishi

Posts
Comments

Posts

notrishi's Shortform 2024-10-31T17:43:45.496Z

Comments

Comment by notrishi (rishi-iyer) on notrishi's Shortform · 2025-02-08T20:00:01.917Z · LW · GW

The Sasha Rush/Jonathan Frankle wager: https://www.isattentionallyouneed.com/ is extremely unlikely to be untrue by 2027, but it's not because another architecture might not be better; it's because it asks whether a transformer-like model will be sota . I think it is more likely that transformers are a proper subset of a class of generalized token/sequence mixers. Even SSMs when unrolled into a cumulative sum are a special case of linear attention.
Personally I do believe that there will be a deeply recurrent method that is transformer-like to succeed the transformer architecture, even though this is an unpopular opinion.

Comment by notrishi (rishi-iyer) on notrishi's Shortform · 2024-11-17T23:48:08.290Z · LW · GW

I have not read this before, thanks. Reminds me a lot of Normal Computings extended mind models. I think these are good ideas worth testing, and there are many others within the same vein. My intuition suggests that any idea that pursues a gradual increase in global information prior to decoding is a worthwhile experiment, whether through your method or similar (doesn't necessarily have to be diffusion on embeddings).

Aesthetically I just don't like that transformers have an information collapse on each token and don't allow backtracking (without significant effort in a custom sampler). In my ideal world we could completely reconstruct prose from embeddings and thus simply autoregress in latent space. I think Yann Lecun has discussed this with JEPA as well.

I originally had my thought from a frequency autoregression experiment I had, where I used a causal transformer on the frequency domain of images (to sort of replicate diffusion). This gradually adds information globally to all pixels due to the nature of the ifft, yet still has an autoregressive backend.

Comment by notrishi (rishi-iyer) on notrishi's Shortform · 2024-11-17T22:29:29.670Z · LW · GW

If blocks of text could be properly encoded and decoded vae style, this would not only improve compute requirements for transformers (since you would only have to predict latents), but it might also offer a sort of "idea" autoregression, where a thought is slowly formulated, then decoded into a sequence of tokens. Alas, it seems there is a large chunk of research left unresolved here; I doubt that researchers are willing to allocate time to this.

Comment by notrishi (rishi-iyer) on notrishi's Shortform · 2024-10-31T03:05:45.639Z · LW · GW

There has been a lot of recent talk on diffusion models implicitly autoregressing in the frequency domain (from low-high, coarse to fine features). I find absolutely no reason that we cannot explicitly autoregress on frequencies using fft and casual attention in the frequency domain for the batched loss. I'll probably attempt this at some point.

User info

Posts

Comments