Transformer inductive biases & RASP

post by Vivek Hebbar (Vivek) · 2022-02-24T00:42:30.084Z · LW · GW · 4 comments

This is a link post for http://proceedings.mlr.press/v139/weiss21a/weiss21a.pdf

Contents

  If you know of any tricks that transformers use that are not covered by RASP, please leave a comment -- it would be highly useful to me.
None
4 comments

This is a paper by Weiss et al. about the "Restricted Access Sequence Processing Language", which is a computational framework for what transformers can do.

DL architectureComputational framework
Feedforward networksCircuits
RNNsFinite state machines
TransformersRASP (this paper)

Any program written in RASP can be implemented straightforwardly by a transformer.  Indeed, the authors demonstrate that RASP can be "compiled" to a transformer, and that transformers can be trained to "mimic a RASP solution".

From the abstract (emphasis mine):

"We map the basic components of a transformer-encoder—attention and feed-forward computation—into simple primitives, around which we form a programming language: the Restricted Access Sequence Processing Language (RASP). We show how RASP can be used to program solutions to tasks that could conceivably be learned by a Transformer, and how a Transformer can be trained to mimic a RASP solution. In particular, we provide RASP programs for histograms, sorting, and Dyck-languages. We further use our model to relate their difficulty in terms of the number of required layers and attention heads: analyzing a RASP program implies a maximum number of heads and layers necessary to encode a task in a transformer. Finally, we see how insights gained from our abstraction might be used to explain phenomena seen in recent works."

One implication of this is that program length (simplicity) in RASP may be a good approximation to the inductive bias of transformers.  I will probably use this in future work along the lines of "How complex are myopic imitators? [AF · GW]".

If you know of any tricks that transformers use that are not covered by RASP, please leave a comment -- it would be highly useful to me.

4 comments

Comments sorted by top scores.

comment by gwern · 2022-02-24T01:28:38.190Z · LW(p) · GW(p)

Some earlier discussion: https://www.lesswrong.com/posts/Lq6jo5j9ty4sezT7r/teaser-hard-coding-transformer-models [LW · GW]

Replies from: Vivek
comment by Vivek Hebbar (Vivek) · 2022-02-24T01:31:55.742Z · LW(p) · GW(p)

Nice!  Do you know if the author of that post was involved in RASP?

Replies from: gwern, Vivek
comment by gwern · 2022-02-24T01:34:07.272Z · LW(p) · GW(p)

They apparently reinvented RASP independently.

comment by Vivek Hebbar (Vivek) · 2022-02-24T01:34:56.643Z · LW(p) · GW(p)

Checked; the answer is no: https://www.lesswrong.com/posts/Lq6jo5j9ty4sezT7r/teaser-hard-coding-transformer-models?commentId=ET24eiKK6FSJNef7G