[Linkpost] The Expressive Capacity of State Space Models: A Formal Language Perspective

post by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-05-28T13:49:38.738Z · LW · GW · 3 comments

This is a link post for https://arxiv.org/abs/2405.17394

Contents

3 comments

Paper authors: Yash Sarrof, Yana Veitsman, Michael Hahn.

Context: architectures with weak forward passes can be differentially transparent; see e.g. this comment [LW(p) · GW(p)] / the whole thread and research agendas like externalized reasoning [LW · GW] or the translucent thought hypothesis [LW · GW].

Summary thread: https://x.com/yashYRS/status/1795340993757352402. Summary of the summary thread: like transformers, which have weak forward passes, SSMs are also in the TC0 computational complexity class, 'but cover distinct fragments within it'. 'SSMs can track hierarchical structures with optimal memory [...] suggesting that SSMs, while being more parallellizable, maintain sufficient power to handle the hierarchical structure of language.'

Abstract:

Recently, recurrent models based on linear state space models (SSMs) have shown promising performance in language modeling (LM), competititve with transformers. However, there is little understanding of the in-principle abilities of such models, which could provide useful guidance to the search for better LM architectures. We present a comprehensive theoretical study of the capacity of such SSMs as it compares to that of transformers and traditional RNNs. We find that SSMs and transformers have overlapping but distinct strengths. In star-free state tracking, SSMs implement straightforward and exact solutions to problems that transformers struggle to represent exactly. They can also model bounded hierarchical structure with optimal memory even without simulating a stack. On the other hand, we identify a design choice in current SSMs that limits their expressive power. We discuss implications for SSM and LM research, and verify results empirically on a recent SSM, Mamba. 

3 comments

Comments sorted by top scores.

comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-05-28T14:21:14.560Z · LW(p) · GW(p)

Whoever (strong?) downvoted this, I'm genuinely curious about the reasoning behind. Especially since my previous linkpost also got downvoted down to negative numbers at some point, for reasons unknown to me.

Replies from: mateusz-baginski
comment by Mateusz Bagiński (mateusz-baginski) · 2024-05-28T15:18:50.671Z · LW(p) · GW(p)

It wasn't me but it's probably about spreading AI capabilities-relevant knowledge.

Replies from: bogdan-ionut-cirstea
comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-05-28T15:22:31.271Z · LW(p) · GW(p)

Huh, that would seem a bit unnuanced to me, especially since I mentioned weak forward passes; though maybe I should add more context, e.g. this comment [LW(p) · GW(p)] / the whole thread. 

Thanks, might try to edit with a bit more context!