Transformer inductive biases & RASP
post by Vivek Hebbar (Vivek) · 2022-02-24T00:42:30.084Z · LW · GW · 4 commentsThis is a link post for http://proceedings.mlr.press/v139/weiss21a/weiss21a.pdf
Contents
If you know of any tricks that transformers use that are not covered by RASP, please leave a comment -- it would be highly useful to me. None 4 comments
This is a paper by Weiss et al. about the "Restricted Access Sequence Processing Language", which is a computational framework for what transformers can do.
DL architecture | Computational framework |
---|---|
Feedforward networks | Circuits |
RNNs | Finite state machines |
Transformers | RASP (this paper) |
Any program written in RASP can be implemented straightforwardly by a transformer. Indeed, the authors demonstrate that RASP can be "compiled" to a transformer, and that transformers can be trained to "mimic a RASP solution".
From the abstract (emphasis mine):
"We map the basic components of a transformer-encoder—attention and feed-forward computation—into simple primitives, around which we form a programming language: the Restricted Access Sequence Processing Language (RASP). We show how RASP can be used to program solutions to tasks that could conceivably be learned by a Transformer, and how a Transformer can be trained to mimic a RASP solution. In particular, we provide RASP programs for histograms, sorting, and Dyck-languages. We further use our model to relate their difficulty in terms of the number of required layers and attention heads: analyzing a RASP program implies a maximum number of heads and layers necessary to encode a task in a transformer. Finally, we see how insights gained from our abstraction might be used to explain phenomena seen in recent works."
One implication of this is that program length (simplicity) in RASP may be a good approximation to the inductive bias of transformers. I will probably use this in future work along the lines of "How complex are myopic imitators? [AF · GW]".
If you know of any tricks that transformers use that are not covered by RASP, please leave a comment -- it would be highly useful to me.
4 comments
Comments sorted by top scores.
comment by gwern · 2022-02-24T01:28:38.190Z · LW(p) · GW(p)
Some earlier discussion: https://www.lesswrong.com/posts/Lq6jo5j9ty4sezT7r/teaser-hard-coding-transformer-models [LW · GW]
Replies from: Vivek↑ comment by Vivek Hebbar (Vivek) · 2022-02-24T01:31:55.742Z · LW(p) · GW(p)
Nice! Do you know if the author of that post was involved in RASP?
Replies from: gwern, Vivek↑ comment by Vivek Hebbar (Vivek) · 2022-02-24T01:34:56.643Z · LW(p) · GW(p)
Checked; the answer is no: https://www.lesswrong.com/posts/Lq6jo5j9ty4sezT7r/teaser-hard-coding-transformer-models?commentId=ET24eiKK6FSJNef7G