Theoretical Limitations of Autoregressive Models

post by Gabriel Wu (gabriel-wu) · 2023-05-30T02:37:51.020Z · LW · GW · 1 comments

This is a link post for https://gabrieldwu.github.io/limitations-autoregressive

Contents

    The Autoregressive Type Signature
    A Note on Architecture
    Randomized Tasks
    Factoring
    Where this comes up
    Ways to get around this
    Conclusion
  Appendix
    Generating primes deterministically
    Even approximating is hard
None
2 comments

Thanks to Alex Cai for the discussion that inspired this post.

The Autoregressive Type Signature

Let's say you want to train an AI chatbot. It needs to be able to take in text (the user's question or instructions) and output text (its response). Intuitively you would think that the model, as a function  from inputs to outputs, should have the following type signature:

or more precisely,

where  is the set of all possible ASCII characters (or in practice, "tokens"),  is the Kleene star, and  denotes the space of all probability distributions over a set .

Unfortunately, it's basically impossible to directly design an architecture of this form. Any neural-network based model needs an output space of constant dimension, which  is not (because there are an infinite number of possible strings).

To get around this, people came up with the ingenious idea of generating text autoregressively. Instead of having the model generate its entire response at once, we just train it to output the very next character of its output, given the user's question and its output so far. This has the type signature

For a given question , we can generate a response character by character. First, sample

Then, sample

repeating this process until the model outputs a special stop token  that indicates it's done with its output. Under this procedure,  induces a function of the form . So why am I making such a big deal about the autoregressive type signature being different than a general text-to-text function?

The point of this post is show that there are many distributions of responses that can easily be outputted a general text-to-text algorithm, but can't be performed by an autoregressive model in polynomial time (given some standard cryptographic assumptions like "factoring semiprimes is hard on average").

A Note on Architecture

Modern language models are almost always transformers, but I won't restrict my analysis to a particular choice of architecture for this post. The results here hold for any polytime algorithm with the autoregressive type signature mentioned above. In particular, I'll allow the algorithm to be randomized and perform looping, as long as it terminates in polynomial amount of time. Both of these are in contrast with transformers, which are deterministic[1] and have bounded circuit depth.

Randomized Tasks

First, I want to distinguish two types of of tasks we may give a model. Most of the time when we prompt a language model, we don't really care about its distribution of responses, as long as the response satisfies our request with high probability. An example is "Write a proof of Fermat's Last Theorem in the style of Shakespeare" -- the model could place 100% probability on a single, high quality response for all we care. Contrast this with what I'll call randomized tasks, in which we care about a property of the model's response distribution more than its particular response. A simple example is "give me a random 6-digit number." We'd be upset if the model always outputted  for this prompt (even if we wouldn't be able to tell from just one sample). While for the first type of task we might measure model's performance as an expectation of a preference function over its response distribution, performance on randomized tasks are better expressed as a function of the response distribution itself.

Autoregressive models can do many randomized tasks just fine. For the previous example, all it needs to do is output a uniform distribution over  for the first character, then a uniform distribution over  for the next five characters, and then the  token.[2]

Autoregressive models can also solve more complicated randomized tasks like "Give me a random -digit prime number" (for some large value of ). Here's where it really helps to allow the model to perform a randomized algorithm. Given a prefix of already-outputted digits, to generate the next digit the model can randomly sample -digit completions of the prefix until it finds a prime. It can then output a next-character probability distribution that places 100% of its mass on the next digit of that prime. It's straightforward to see that this process induces a perfect uniform distribution over the primes, and it also runs in expected  time. (See details about a deterministic solution in the Appendix.

Factoring

Next, consider the task: "Choose two random -digit prime numbers , then output the tuple  and nothing else." Given the ability to do the prime generation task in the previous section, this should be a no-brainer. The model first generates appropriate distributions for the digits of , and does it again for . Then when generating the digits of , it looks back at the values of  and  it has previously generated, multiplies them in its head, then outputs the digits of this product one at a time (always placing 100% probability on the single correct next digit). The resulting distribution of responses perfectly matches the desired one.

But now, consider this seemingly trivial variation of the task: "Choose two random -digit prime numbers , then output the tuple  and nothing else." All we've done here is changed the order of the tuple so the model is forced to generate the digits of  before either of its factors.

Claim 1: No polytime randomized algorithm can output  autoregressively with the correct distribution.

The insight here is that an autoregressive algorithm has no place to "store its work" apart from the tokens it directly outputs. When generating the first digit of , the algorithm  only has access to what's written in the output stream -- in this case . It has no way of knowing the values of any intermediate variables it used to generate  in the first place.

In fact, we can use this observation to prove the claim. Assume for the sake of contradiction that there exists such an algorithm . Then, we can use  to factor any semiprime in polynomial time (a semiprime is an integer with exactly two prime factors).[3] Given a semiprime , we simply prompt the model with:

and then autoregressively generate the rest of its response, which will have the form: . We're then guaranteed that  because conditioning the true distribution of  on the first element of the tuple forces the next two integers to be the correct factors. But since factoring arbitrary semiprimes is hard, no polytime  is believed to exist.

This demonstrates that some output distributions which are easily generated by a general text-to-text algorithm (of the form ) cannot be generated autoregressively. See the Appendix for a stronger result which shows that the correct output distribution cannot even be approximated autoregressively.

Where this comes up

At this point, you might rightfully think: OK, but why would you ever ask an autoregressive language model to perform such a weird task? If for some reason you really need a random semiprime and its factors, why not just ask the model to output the tuple in the original order ?

This example naturally extends to a more general class of randomized tasks which involve the model needing to make and remember randomized decisions that are not visible to the user until later in the conversation. For example,

Ways to get around this

Do these considerations present a barrier to achieving human-level AI systems? Not at all. There are many simple ways to adapt the autoregressive type signature that get around this problem.

Conclusion

While I don't expect this observation about the limitations of autoregressive type signatures to be that important in the design of contemporary ML systems, it does provide an interesting theoretical idea to consider. There are various ways to adapt the autoregressive type signature to address these limitations, but understanding the fundamental differences between general text-to-text functions and autoregressive models remains valuable. Thank you for reading, and I encourage you to share your thoughts or point out mistakes!


Appendix

Generating primes deterministically

Say we want to solve the task "Give me a random -digit prime number," but with a deterministic algorithm like a transformer. We're no longer able to make random choices (like choosing random numbers until we stumble upon a prime) to generate digits. Instead, for any prefix  we have to deterministically output  probabilities that correspond to the proportion of primes starting with prefixes .

I think this can roughly be done with the following procedure. The distribution over the first character should be slightly weighted towards lower digits because primes get rarer the bigger they are (to determine how much to weight smaller digits, the model could integrate the average prime density  over the appropriate ranges). The second character's distribution would depend on what it outputted for the first character, but it can also be approximated with an integral. The model could repeat this process up until the last few characters, at which point the integral approximation wouldn't be precise enough. For these last few digits, the model would have to manually think through all possible endings it could create, run primality tests on each of them, and then weight its distribution to be proportional to the number of primes it found in each range. Clearly, this process won't exactly achieve the distribution we want -- as far as I know, it's impossible to compute the exact proportion of -digit primes that start with a given prefix in polynomial time unless you assume some crazy things like . However, modulo some messy details, it seems like there exists a polynomial-time autoregressive algorithm that can output the desired distribution with some exponentially small error, as measured by total variational distance[4].

Even approximating is hard

Let  be the set of all valid tuples  for -digit primes  and . We've shown with Claim 1 that no polytime  can autoregressively produce a uniform distribution over . But maybe there's a way to get the distribution approximately correct? Here we show that this is also impossible. First, we observe the following.

Claim 2: Say that a polytime randomized algorithm  has an autogressive response distribution with support  when prompted as above. Then  (it must be vanishingly sparse).

This follows from our belief that no polytime algorithm exists that can factor even a constant fraction of semiprimes. Since  only ever outputs correct factorizations, any semiprime that appears in its support must be factorizable by the previous reduction with probability .

Next, we show that even if we allow  to sometimes output incorrect answers, the distance between the two distributions must still be large.

Claim 3: For any polytime randomized algorithm , the total variational distance[4] between its autoregressive response distribution and  must be .

Proof. This is a trickier; feel free to skip it if you're not interested (this is an appendix after all). Let , i.e. there are  relevant semiprimes . WLOG, assume that  only ever outputs tuples of the form  where  (note that we're not requiring ). Doing anything else would only increase its TVD from the desired distribution without affecting the proof, since we always condition  on a semiprime.

Say that when we try to get  to factor , it outputs the right factorization with probability . By standard cryptographic assumptions, factoring semiprimes is "hard on average", which means picking a random semiprime and trying to factor it with  has a vanishing probability of working. We can write this as:

However,  may try to cheat by outputting certain "easier" semiprimes more frequently than others. Say that  chooses to output  as the first element of its tuple with probability . Then the "overlap" between 's autoregressive response distribution and  can be written as:

We want to prove an upper bound on this overlap. How large can this expression be? Think of this in terms of the following setup: You have  types of drinks, and each drink has a certain sugar ratio . You want to mix the drinks together in some way, with drink i making up  proportion of the mixed drink (). The sweetness of the mixed drink is the appropriate linear combination of sweetnesses of its component drinks, but with the strange restriction that the absolute sugar contribution from any given drink type is capped at  (to mirror the expression above). How should you combine the drinks to get the sweetest mixture?

It's clear that you should start by pouring in the sweetest drink until its absolute sugar contribution reaches . Then, pour in the next sweetest drink until its contribution reaches , and so on. Keep doing this until you've used a total volume of  (say that it requires going through  drink types before this happens -- there's a potential off-by-one error here but it doesn't matter). If we assume , this corresponds to  volume of drink , for . The sweetness of the mixture is:

and we have the constraint that the total volume is :

How large can  be? Well, given the constraint that , the smallest possibility for  is achieved when  by concavity. But this means

So we've proven the sweetness of any mixed drink, and thus the overlap between 's autoregressive response distribution and , must be vanishingly small. Thus,  


  1. ^

    Don't confuse the randomization that comes from the sampling procedure with internal randomness of the model. Transformers are completely deterministic functions from a sequence of tokens to a vector of output logits. The only randomness during the text generation process comes from the process of sampling from the next-token distribution at every step. Constrast this with general random algorithms, for which the output logits themselves are allowed to be randomly generated. Of course, we can convert any such randomized algorithm  into a deterministic algorithm  by averaging its output distribution over all possible sequences of internal coin flips (this is a "conversion" in the sense that their response distributions on any input, as induced by autoregressive generation, are equivalent). However,  may no longer be polytime.

  2. ^

    In practice though, GPT-4 is really bad at this. I asked this exact prompt a bunch of times at temperature 1 and got the following responses: 253489, 572931, 523749, 935271, 754823, 543219, 523891, 523689, 537218, 247856, 527893, 537891, 523198, 523817, 523481, 254839. I'm sure we could solve this with training techniques better suited for randomized tasks, although it's very interesting to think about how the current RL from human feedback methods would have to be modified to accomplish this.

  3. ^

    ...as long as both factors have the same number of digits . But this can be fixed with a trivial restatement of the prompt.

  4. ^

    Given two ditributions , we define their total variational distance to be . Note that this distance is always between  and .

1 comments

Comments sorted by top scores.

comment by gjm · 2023-05-30T10:32:12.886Z · LW(p) · GW(p)

The "hidden tokens" approach seems to be pretty much what (some of?) our somewhat-LLM-like brains do: if we have something complicated to think about, or we don't want to share every detail of our thinking with those around us, we produce something resembling language or imagery internally but don't say it out loud.

My understanding is that people vary more than one might think in exactly what this internal thinking-trace looks like -- e.g., some people are more verbal, some less. I'm pretty verbal myself and don't know what the experience of thinking through, say, how to prove a theorem or implement an algorithm is like for less verbal people. Maybe it's not so much like "hidden tokens". But my guess is that it is still somewhat hidden-token-like, under the hood. (Maybe something like an "internalized" version of the output of multimodal models that produce both text and images? But I don't know much about how those work.)