Posts
Comments
Alright. Interested to see the new post. Your content is great btw.
A bit late to the party. Love the article, but I believe it is somewhat misleading when you say that transformers run in constant time complexity.
If the number of tokens in the input sentence is the input size of its time complexity, which I'm sure you can agree is the obvious choice; The transformer encoder is run on each token in the sentence, in parallel if needed, but it still has to do all of its computations for each input token, immediately causing at least O(n) time.
I do think that the point you are trying to give is different though. Correct me if I'm wrong but it seems like you are saying that for each token generated, the transformer is only allowed to process for a constant amount of time, and therefore it is impossible for it to do division or other things that require time complexities more complicated than O(1), without failing at some hard problem. Additionally assuming it is only generating one token.
If that is your argument I do still find it interesting, as it means that a large portion of intelligence can be simulated in O(1). But all I am trying to state is that that is not the actually the time complexity of the transformer. It is the time complexity that the transformer can simulate, assuming it is only generating one token.