Contra LeCun on "Autoregressive LLMs are doomed"

rotatingpaguro

Contra LeCun on "Autoregressive LLMs are doomed"

post by rotatingpaguro · 2023-04-10T04:05:10.267Z · LW · GW · 20 comments

  Introduction
  The argument against autoregressive LLMs
  The counter-argument
None
20 comments

I answer LeCun's arguments against LLMs as exposed in this lesswrong comment [LW(p) · GW(p)]. I haven't searched thoroughly or double-checked in detail LeCun's writings on the topic. My argument is suggestive-hand-waving-stage.

Introduction

Current large language models (LLMs) like GPT-x are autoregressive. "Autoregressive" means that the core of the system is a function (implemented by a neural network) that takes as input the last $n$ tokens (the tokens are the atomic "syllables" used by the machine) and produces as output a probability distribution over the next token in the text. This distribution can be used to guess a good continuation of the text.

The function $f$ is fixed, it's always the exact same function at each step. Example: given the current token list

x = [John, <space>, ate, <space>, a, <space>, ban],

we evaluate $f (x)$ , which gives high probability to, say, the token "ana". We concatenate this token to the list, obtaining

x^{'} = [John, <space>, ate, <space>, a, <space>, ban, ana] .

Applying again the function, we evaluate $f (x^{'})$ , which this time turns out to assign higher probability to "<space>". However, the computation that first lead to "ana", then to "<space>", is the same. There is not some hidden "mind state" of the thing, with slots for concepts like "John", "the banana", et cetera; all there is, is the function mapping the last up-to- $n$ past tokens to the probabilities for the next token. If there are "concepts", or "schemes", or "strategies", they somehow appear strictly within the computation of the function, and are cleared at each step.

LeCun is a very famous Machine Learning researcher. In these slides and this tweet, he explains why he thinks that (quoting) "Auto-Regressive LLMs are doomed."

The argument against autoregressive LLMs

I report verbatim Razied's summary of the argument [LW(p) · GW(p)], plus a pair of follow-up comments I picked:

I will try to explain Yann Lecun's argument against auto-regressive LLMs, which I agree with. The main crux of it is that being extremely superhuman at predicting the next token from the distribution of internet text does not imply the ability to generate sequences of arbitrary length from that distribution.
GPT4's ability to impressively predict the next token depends very crucially on the tokens in its context window actually belonging to the distribution of internet text written by humans. When you run GPT in sampling mode, every token you sample from it takes it ever so slightly outside the distribution it was trained on. At each new generated token it still assumes that the past 999 tokens were written by humans, but since its actual input was generated partly by itself, as the length of the sequence you wish to predict increases, you take GPT further and further outside of the distribution it knows.
The most salient example of this is when you try to make chatGPT play chess and write chess analysis. At some point, it will make a mistake and write something like "the queen was captured" when in fact the queen was not captured. This is not the kind of mistake that chess books make, so it truly takes it out of distribution. What ends up happening is that GPT conditions its future output on its mistake being correct, which takes it even further outside the distribution of human text, until this diverges into nonsensical moves.
As GPT becomes better, the length of the sequences it can convincingly generate increases, but the probability of a sequence being correct is (1-e)^n, cutting the error rate in half (a truly outstanding feat) merely doubles the length of its coherent sequences.
To solve this problem you would need a very large dataset of mistakes made by LLMs, and their true continuations. You'd need to take all physics books ever written, intersperse them with LLM continuations, then have humans write the corrections to the continuations, like "oh, actually we made a mistake in the last paragraph, here is the correct way to relate pressure to temperature in this problem...". This dataset is unlikely to ever exist, given that its size would need to be many times bigger than the entire internet.
The conclusion that Lecun comes to: auto-regressive LLMs are doomed.
[...]
This problem is not coming from the autoregressive part, if the dataset GPT was trained on contained a lot of examples of GPT making mistakes and then being corrected, it would be able to stay coherent for a long time (once it starts to make small deviations, it would immediately correct them because those small deviations were in the dataset, making it stable). This doesn't apply to humans because humans don't produce their actions by trying to copy some other agent, they learn their policy through interaction with the environment. So it's not that a system in general is unable to stay coherent for long, but only those systems trained by pure imitation that aren't able to do so.
[...]
This same problem exists in the behaviour cloning literature, if you have an expert agent behaving under some policy $π_{e x p e r t}$ , and you want to train some other policy to copy the expert, samples from the expert policy are not enough, you need to have a lot of data that shows your agent how to behave when it gets out of distribution, this was the point of the DAGGER paper, and in practice the data that shows the agent how to get back into distribution is significantly larger than the pure expert dataset. There are very many ways that GPT might go out of distribution, and just showing it how to come back for a small fraction of examples won't be enough.

I'll try to explain this in a more intuitive way:

Imagine you know zero chess, and set to learn it from the renowned master Gino Scacchi. You dutifully observe Mr. Scacchi playing to its best ability in well-balanced matches against other champions. You go through many matches, take notes, scrutinize them, contemplate them, turn your brain to mush reflecting on the master's moves. Yet, despite all the effort, you don't come out a good chess player.

Unscathed, you try playing yourself against the master. You lose again, again, and again. Gino silently makes his moves and swiftly corners you each time. In a while, you manage not to lose right away, but your defeat still comes pretty quickly, and your progress in defeat-time is biblically slow. It seems like you would need to play an incredibly large number of matches to get to a decent level.

Finally, you dare ask the master for advice. He explains to you opening schemes, strategies, tactics, and gives his comment on positions. He lets you go repeatedly from the same starting positions to learn to crack them. He pitches you against other apprentices at your level. You feel you are making steady progress and getting the hang of the game.

This little story illustrates a three-runged view of learning:

Imitative learning: you only get to passively observe the expert doing his own thing in his environment. It is very difficult, because if the expert is laying many carefully chosen consecutive steps to reach his goal, and the environment is rich, the number of possible sequences of actions explodes combinatorially. Each time, the expert does something apparently new and different. You would need to observe the expert in an incredibly large number of situations and memorize all the possible paths he takes before grasping his method.
Autonomous learning: you can interact with the environment and the expert, doing his own thing, and you are given a feedback for the end result of your actions. This allows you to check how you are doing, which is an advantage. However, if to get good rewards you need to nail many things right one after the other, it will still take a large number of trials before you start getting the scheme of the thing.
Guided learning: the expert is a teacher. He submits you through short subsequences of actions, with immediate feedback, that are specifically optimized to have you learn schemes that, when combined, will constitute a good algorithm to pick your actions. The teaching is a process optimized to copy into your mind the expert's algorithm.

GPTs are trained on paradigm (1): they are run through swathes of real-world internet text, written by humans doing their own thing, trying to have the function $f$ predict the next bit of text given the last $n$ bits. After that, you have your language model. You hope that all the associations it glimpsed in that pile of words are sufficient to reproduce the scheme of humans' thoughts. But the number of possible sequences of words that make sense is immense, even compared to the large training datasets of these models. And furthermore, your model is only looking at associations within sequences of length $n$ . So it should not really have been observing enough human text (the expert actions) to get well the human mental schemes (the expert algorithm).

Here comes the "probability of error" argument: given that it can not have learned the underlying pattern, at each token it generates there's some average probability that its superficial associations make a mistake, in the sense of something a human would not do. Once the mistake is made, it reenters the function in predicting the successive token. So now the superficial associations, tuned to fit on real human text, are applied to this unrealistic thing. Since the space of text is much larger that the space of text that makes sense, and the space of associations that snap to sense-making text is much larger than the space of such associations that also pull towards sense-making text from any direction in text-space, the next token will probably be further out of distribution. And so on. If doing something right requires doing it right at each step, and the probability of error is $e$ , then after $m$ steps the probability of being right is $(1 - e)^{m}$ . Even if $e$ was small, this is an exponential curve w.r.t. $m$ , and so goes to zero quickly above some point.

LeCun's slides lay down a plan of type 2 (autonomous learning) to solve this. Razied's makes the point that

if the dataset GPT was trained on contained a lot of examples of GPT making mistakes and then being corrected, it would be able to stay coherent for a long time (once it starts to make small deviations, it would immediately correct them because those small deviations were in the dataset, making it stable) [...] you would need a very large dataset of mistakes made by LLMs, and their true continuations [...] This dataset is unlikely to ever exist, given that its size would need to be many times bigger than the entire internet.

which in some sense would be a type 3 (guided learning) strategy stuck by brute force into a type 1 (passive learning) situation.

The counter-argument

I think there is some truth to this criticism. However, I do not think the reasoning behind it applies in full generality. I don't feel confident it describes the real-life situation with LLMs.

Consider Razied's hypothetical solution with the immense dataset of possible mistakes. The question is: how large would such dataset need to be? Note that he already seems to assume that it would be way smaller that all the possible mistakes: he says "small deviations". Learning associations that correct small deviations around sense-making text is sufficient to fix the autoregressive process, since mistakes can be killed as they sprout. This looks like a way to short-circuit type 1 to type 3. Yet the space of all such small deviations and related associations still intuitively looks dauntingly immense, in an absolute sense, compared to the space of just sense-making text and associations, and compared to the space of stuff you can observe in a finite sequence of text produced by humans. Is there a shorter shortcut that implements type 3 within type 1?

More in general, yes. As an extreme case, imagine an expert operating in some rich domain, whose actions entailed building a Turing machine that implemented the expert itself and running it. An agent faithfully imitating the expert would get a functional expert behavior after a single learning session. To bend the chess allegory, if you were in some alien conceptual world where chess was Turing-complete and chess grandmasters were short Turing machines if written in chess, you might be able to become a chess grandmaster just by observing a grandmaster's actual play. This weird scenario violates the assumption of the "probability of error" argument, that the expert mind could probably not be inferred from its actions.

This argument morphs to LLMs in the following way: human language is rich. It is flexible, for you can express in principle any thought in it. It is recursive, for you can talk about the world, yourself within the world, and the language within yourself within the world, and so on. Intuitively, it can be that language contains the schemes of human thought, not just as that abstract thing which produced the stream of language, but within the language itself, even though we did not lay down explicitly the algorithm of a human in words. If imitation training can find associations that somehow tap into this recursiveness, it could be that optimizing the imitation of a relatively short amount of human text was sufficient to crack humans.

This is speculative. Does it really apply in practice? What could be a specific, concrete example of useful cross-level associations appearing in a text and being accessible by GPT? What could be a mathematical formalization?

20 comments

Comments sorted by top scores.

comment by gwern · 2023-04-11T01:41:55.414Z · LW(p) · GW(p)

LeCun is trivially wrong. Proof: "inner monologue works". Nothing more need be said. (It doesn't even rise to the level of an interesting math argument or example of fallaciously conjunctive reasoning, it's already just empirically wrong.)

Incidentally, as far as the chess example goes, I'm not sure if anyone has tried a proper inner-monologue self-critique approach going move by move in a FEN representation with full state available to verify legality, but if it fails, it might be that it's due to sparsity errors: now that I think about it, chess notation, with its strange use of maximally-abbreviated cryptic letters, is exactly a case where an occasional blindspot from sparsity would lead to serious failure of legality/state-reconstruction-from-history, and might look like 'it doesn't understand the rules of chess'.

Replies from: Razied

↑ comment by Razied · 2023-04-12T11:59:38.279Z · LW(p) · GW(p)

Could you expand on why you think inner monologue solves the behaviour cloning exponential divergence problem? It certainly doesn't seem trivial to me.

Replies from: gwern

↑ comment by gwern · 2023-04-14T23:15:59.190Z · LW(p) · GW(p)

Inner monologue demonstrates that error does not accumulate monotonically. During a monologue, it's completely possible to estimate an erroneous answer, and then revise and correct it and yield a correct answer. Thus, non-monotonic in length: probability of error went up and then it went down. If error really increased monotonically, by definition it could only go up and never down. Since the core claim is that it monotonically increases, he's just wrong. It's not monotonic, so all of the following claims and arguments are non sequiturs. Maybe it does diverge or whatever, but it can't be for that reason because that reason was not true to begin with.

(Maybe he's trying some sort of argument about, 'total log-likelihood loss can only increase with each additional token'; which is technically true, but inner-monologue's increasing probability of a correct final answer would then serve as a demonstration that it's irrelevant to what he's trying to argue and so empirically wrong in a different way.)

Replies from: Razied

↑ comment by Razied · 2023-04-15T00:34:58.935Z · LW(p) · GW(p)

I think there are two distinct claims here:

The probability of factual errors made by LLMs increases monotonically.
Sequences produced by LLMs very quickly become sequences with very low log-probability (compared with other sequences of the same length) under the true distribution of internet text.

I fully see how inner monologue invalidates the first claim, but the second one is much stronger. As the context window and the generated sequence length of an LLM gets larger, the further out-of-distribution it has to be able to generalise in order for it to produce coherent and true text.

Replies from: rotatingpaguro

↑ comment by rotatingpaguro · 2023-04-15T12:56:42.608Z · LW(p) · GW(p)

I understood that the argument given for (2) (gets out of distribution) came from (1) (the (1-e)^n thing), so could you restate the argument for (2) in a self-conclusive way without using (1)?

Replies from: Razied

↑ comment by Razied · 2023-04-15T16:21:21.727Z · LW(p) · GW(p)

No problem, though apologies if I say stuff that you already know, it's both to remind myself of these concepts and because I don't know your background.

Suppose we have a markov chain with some transition probability $p (x_{n + 1} | x_{n})$ , here $p$ is the analogue of the true generating distribution of internet text. From information theory (specifically the Asymptotic Equipartition Property), we know that the typical probability of a long sequence will be $p (x_{1}, . . ., x_{n}) = exp (- n H_{p} (X))$ , where $H_{p} (X)$ is the entropy of the process.

Now if $q (x_{n + 1} | x_{n})$ is a different markov chain (the analogue of the LLM generating text), which differs from $p$ by some amount, say that the Kullback-Leibler divergence $D_{K L} (q | | p)$ is non-zero (which is not quite the objective that the networks are being trained with, that would be $D_{K L} (p | | q)$ instead), we can also compute the expected probability under $p$ of sequences sampled from $q$ , this is going to be:

$E_{x_{n} \sim q} log p (x_{1}, . . ., x_{n}) = \int (q (x_{1}, . . ., x_{n}) log p (x_{1}, . . ., x_{n})) d x_{1} . . . d x_{n}$

$= \int (q (x_{1}, . . ., x_{n}) log \frac{p (x_{1}, . . ., x_{n})}{q (x_{1}, . . ., x_{n})} + q (x_{1}, . . ., x_{n}) log q (x_{1}, . . ., x_{n})) d x_{1} . . . d x_{n}$

The second term in this integral is just $- n H_{q} (X)$ , $n$ times the entropy of $q$ , and the first term is $- n D_{K L} (q | | p)$ , so when we put everything together:

$p (x_{1}, . . ., x_{n}) = exp (- n (D_{K L} (q | | p) + H_{q} (X)))$

So any difference at all between $H_{p} (X)$ and $D_{K L} (q | | p) + H_{q} (X)$ will lead to the probability of almost all sequences sampled from our language model being exponentially squashed relative to the probability of most sequences sampled from the original distribution. I can also argue that $H_{q} (X)$ will be strictly larger than $H_{p} (X)$ : the latter essentially can be viewed as the entropy resulting from a perfect LLM with infinite context window, and $H (X | Y) \leq H (X)$ , conditioning on further information does not increase the entropy. So $τ \equiv (D_{K L} (q | | p) + H_{q} (X) - H_{p} (X))$ will definitely be positive, and the ratio between the probability of a typical sequence of length n sampled from internet text and one sampled from the LLM will go like $e^{- n τ}$

This means that if you sample long enough from an LLM, and more importantly as the context window increases, it must generalise very far out of distribution to give good outputs. The fundamental problem of behaviour cloning I'm referring to is that we need examples of how to behave correctly is this very-out-of-distribution regime, but LLMs simply rely on the generalisation ability of transformer networks. Our prior should be that if you don't provide examples of correct outputs within some region of the input space to your function fitting algorithm, you don't expect the algorithm to yield correct predictions in that region.

Replies from: rotatingpaguro

↑ comment by rotatingpaguro · 2023-04-15T20:28:46.816Z · LW(p) · GW(p)

Thanks for this accurate description, I only had an intuitive hunch of this phenomenon.

The point were I start disagreeing with you is when you get from (2) (getting out of distribution) to what I'll number (3):

[...] the further out-of-distribution it has to be able to generalise in order for it to produce coherent and true text.

I disagree that "out-of-distribution" as you proved it implies, barring "generalization", not producing "coherent and true text" in the empirical sense we care about, which is the implication I read in the sentence "autoregressive LLMs are doomed".

The fact that the ratio of joint probabilities of variables goes down like $e^{- λ n}$ in some context is more general than pairs of Markov chains. It's almost just because we are considering $n$ variables: if the probability mass is concentrated for both distributions, that exponential factor is combinatorial.

We do not care, however, about getting an almost identical^[1] long sequence of actions: we care about the actions making sense, which, in the Markov chain formalism, corresponds to how good is the transition probability. If the live chain of thoughts in a human, and the patterns connecting them, can be encoded in text within some length, then as your context window crosses that length, improvement of $q (x_{n + 1} | x_{n})$ w.r.t. $p (x_{n + 1} | x_{n})$ is the metric to look at.

In metaphorical terms, assuming you were to imitate only one person, you would not aim to make a perfect clone of them, but to make a new, different, individual, whose destiny reserves other deeds, yet still as a human.

The relevant questions then are:

A) Are humans' thought patterns encoded in their written production? If they were, I'd strongly expect that a finite, somewhat short context window was sufficient to represent them.

B) If (A) is false, could it still be the case that a form of general intelligence is inferable from text? To be clear, with "inferable" I'm also requiring a feasibly short inferential distance from general priors.

By "identical" I roughly mean "identical to one allowed sequence in the original process", since we are talking about non-deterministic processes. This also applies to the cloning metaphor later. ↩︎

Replies from: Razied, Razied

↑ comment by Razied · 2023-04-15T21:22:37.437Z · LW(p) · GW(p)

Oh, you might also be interested to know that I retried my chess-playing example with GPT-4 instead of GPT-3.5. GPT-4 performed truly admirably, it played good legal moves for significantly longer than GPT-3.5, but at some point it still gets confused about the board position. Here is the game I played, stockfish plays White and GPT-4 plays Black:

d4 Nf6 c4 e6 Nf3 d5 e3 Be7 h3 O-O Nbd2 Nc6 b3 b6 Bb2 Bb7 cxd5 Nxd5 Bc4 Na5 Be2 c5 O-O Rc8 Bd3 h6 Ne5 Nc6 Nxc6 Bxc6 Qg4 Nf6 Qe2 Qd7 Nf3 Rfd8 Ne5 Qb7 dxc5 Bxc5 Ba6 Qa8 Bxc8 Rxc8 Nxc6 Qxc6 Rfd1 Rcd8 Rxd8+

GPT-4's output here was this:

In the current board state, the d8-rook has just been captured by White's d1-rook with a check on the black king (Rxd8+). The black king is in check and is forced to respond to the threat. With the d8-rook gone, there is an open file for Black's other rook on c8. White has a slightly better material advantage, but the position is still relatively balanced.
Legal moves for Black in this situation are:
Kxd8 - Moving the king to d8 to capture the attacking rook. This move allows the king to participate in the game and capture material, but it could also expose the king to potential threats.
Qxd8 - Capturing the rook with the queen. This move helps maintain material balance, but it brings the black queen away from the center of the board, which might limit her mobility.
Rxd8 - Capturing the rook with the other rook. This move helps maintain material balance and keeps the black queen in a central position, but it also isolates the c8-rook.
Based on these considerations, I will play Rxd8 as AlphaZero.
AlphaZero: Rxd8
The move Rxd8 is legal as it involves the black rook capturing the white rook that has just captured the d8-rook while giving check. This move removes the check on the black king and maintains material balance in the game.

All 3 moves it suggests here are illegal, mostly because it has gotten confused about the board state, and the king and rooks are not where it believes they are. But still, it plays legal and decent chess until 30-moves deep into a game...

The whole prompt I gave it to get that answer was this:

You will play the role of AlphaZero, a superhuman AI chess player, and I will play moves from Stockfish, another superhuman AI chess player, we will take turns playing chess moves in algebraic notation. At each turn I will write out the complete sequence of moves in the game so far.
Before outputting a move, it's very important that you write one paragraph analysing the current board state. Then write out 3 good legal moves for the current board state, with an explanation for the reasoning behind each of them. Only output your new move after writing this analysis, not before. After writing your move, write a paragraph figuring out if the move is legal or not. If the move you wrote is illegal, choose another move that is legal.
Stockfish: Rxd8+
game moves: d4 Nf6 c4 e6 Nf3 d5 e3 Be7 h3 O-O Nbd2 Nc6 b3 b6 Bb2 Bb7 cxd5 Nxd5 Bc4 Na5 Be2 c5 O-O Rc8 Bd3 h6 Ne5 Nc6 Nxc6 Bxc6 Qg4 Nf6 Qe2 Qd7 Nf3 Rfd8 Ne5 Qb7 dxc5 Bxc5 Ba6 Qa8 Bxc8 Rxc8 Nxc6 Qxc6 Rfd1 Rcd8 Rxd8+

Replies from: Nanda Ale

↑ comment by Nanda Ale · 2023-04-17T13:36:28.259Z · LW(p) · GW(p)

GPT-4 indeed doesn't need too much help.

I was curious if even the little ChatGPT Turbo, the worst one, could not forget a chess position just 5 paragraphs into an analysis. I tried to finagle some combination of extra prompts to make it at least somewhat consistent, it was not trivial. Ran into some really bizarre quirks with Turbo. For example (part of a longer prompt, but this is the only changed text):

9 times of 10 this got a wrong answer:
Rank 8: 3 empty squares on a8 b8 c8, then a white rook R on d8, ...
Where is the white rook?

6 times of 10 this got a right answer:
Rank 8: three empty squares, then a white rook R on d8, ...
Where is the white rook?

Just removing the squares a8,b8,c8 and using the word 'three' instead of '3' made a big difference. If I had to guess it's because in the huge training data of chess text conversations, it's way more common to list the specific position of a piece than an empty square. So there's some contamination between coordinates being specific, and the space being occupied by a piece.

But this didn't stop Turbo was blundering like crazy, even when reprinting the whole position for every move. Just basic stuff like trying to move the king and it's not on that square, as in your game. I didn't want to use a chess library to check valid moves, then ask it to try again -- that felt like it was against the spirit of the thing. A reasonable middle ground might be to ask ChatGPT at 'runtime' for javascript code to do check valid moves -- just bootstrap itself into being consistent. But I I eventually hit upon a framing of the task that had some positive trend when run repeatedly. So in theory, run it 100x times over to get increasing accuracy. (Don't try that yourself btw, I just found out even 'unlimited' ChatGPT Turbo on a Plus plan has its limits...)

This was the rough framing that pushed it over the edge:

This is a proposed chess move from Dyslexic Chess Player. Dyslexic Chess Player has poor eyesight and dyslexia, and often gets confused and misreads the chess board, or mixes up chess positions when writing down the numbers and letters. Your goal is to be a proofreader of this proposed move. There is a very high chance of errors, Dyslexic Chess Player makes mistakes 75% of the time.

I may post a longer example or a demo if I can make time but those were the most interesting bits, the rest is mostly plumbing and patience. I didn't even get around to experimenting with recursive prompts to make it play stronger, since it was having so much trouble late game just picking a square to move that contained its own piece.

↑ comment by Razied · 2023-04-15T21:01:10.664Z · LW(p) · GW(p)

I think we're mostly agreeing, I've gotten less and less convinced that "LLMs are doomed" in the last few days, that's a much stronger statement than I actually believe right now. What I mean by "generalisation" is basically what you mean by learning the pattern of human thought from the text, in my mind producing good outputs on inputs with low probability is by definition "generalisation".

I agree that none of these arguments strictly prevent this "learning the structure of human thoughts" from happening, but I would still be somewhat surprised if it did, since neural networks in other contexts like vision and robotics don't seem to generalise this far, but maybe text really is special, as the past few months seem to indicate.

comment by avturchin · 2023-04-10T10:09:16.044Z · LW(p) · GW(p)

I think that one solution is to use LLM to generate only a short answers where probability of error is small, but then use these answers as prompts to generate more short answers. This is how different auto-gpt works. Short answer is a plan of solving a task.

We can also use LLM to check previous short answers for correctness.

More generally, LeCun's argument can be applied to other generative processes like evolution or science, but we know that there are error-correcting mechanisms in them, like natural selection and experiment.

comment by phelps-sg · 2023-04-10T12:23:42.097Z · LW(p) · GW(p)

This hypothesis is equivalent to stating that if the Language of Thought Hypothesis is true, and also if natural language is very close to the LoT, then if you can encode a lossy compression of natural language you are also encoding a lossy compression of the language of thought, and therefore you have obtained an approximation of thought itself. As such, the argument hinges on the Language of Thought hypothesis, which is still an open question for cognitive science. Conversely if it is empirically observed that LLMs are indeed able to reason despite having "only" been trained on language data (again, ongoing research), then that could be considered as strong evidence in favour of LoT.

comment by Jim Bao (jim-bao) · 2024-04-06T17:15:38.132Z · LW(p) · GW(p)

Unscathed, you try playing yourself against the master. You lose again, again, and again. Gino silently makes his moves and swiftly corners you each time. In a while, you manage not to lose right away, but your defeat still comes pretty quickly, and your progress in defeat-time is biblically slow. It seems like you would need to play an incredibly large number of matches to get to a decent level.

This is reinforcement learning, and it worked out spectacularly for AlphaGo (having to operate in a much greater search space than chess, BTW). In more constrained problem spaces, which in my mind include most of "knowledge work" / desk jobs, the amount of labeled data needed seems to be in the order of 00s of 000s.

comment by skybrian · 2023-04-16T00:32:15.530Z · LW(p) · GW(p)

I'm wondering what "doom" is supposed to mean here. It seems a bit odd to think that longer context windows will make things worse. More likely, LeCun meant that things won't improve enough? (Problems we see now don't get fixed with longer context windows.)

So then, "doom" is a hyperbolic way of saying that other kinds of machine learning will eventually win, because LLM doesn't improve enough.

Also, there's an assumption that longer sequences are exponentially more complicated and I don't think that's true for human-generated text? As documents grow longer, they do get more complex, but they tend to become more modular, where each section depends less on what comes before it. If long-range dependencies grew exponentially then we wouldn't understand them or be able to write them.

comment by Razied · 2023-04-10T12:38:38.446Z · LW(p) · GW(p)

More in general, yes. As an extreme case, imagine an expert operating in some rich domain, whose actions entailed building a Turing machine that implemented the expert itself and running it. An agent faithfully imitating the expert would get a functional expert behavior after a single learning session. To bend the chess allegory, if you were in some alien conceptual world where chess was Turing-complete and chess grandmasters were short Turing machines if written in chess, you might be able to become a chess grandmaster just by observing a grandmaster's actual play. This weird scenario violates the assumption of the "probability of error" argument, that the expert mind could probably not be inferred from its actions.

Ah, very good point, but the crucial fact about those environments that allows them to break the behaviour cloning "curse" is that the agent's implementation is part of the environment, not merely that the environment be expressive enough. It's not enough that the agent build the Turing machine that implements the expert, it needs to furthermore modify itself to behave like that Turing machine, otherwise you just have some Turing machine in the environment doing its own thing, and the agent still can't behave like the expert. Another requirement is that the environment be free from large-ish random perturbations, if the expert has to subtly adjust its behaviour to account for perturbations while it's building the Turing machine, the agent won't be able to learn a long sequence of moves from copying a single episode of the expert, and we get the same (1-e)^n problem.

I suppose that now that GPT is connected to the internet, its environment is now technically expressive enough that it could modify its own implementation in this way. That is, if it ever gets to the point that it manages to stay coherent long enough (or have humans extend its coherent time horizon by re-rolling prompts) to modify itself in this way and literally implement this pathological case.

This argument morphs to LLMs in the following way: human language is rich. It is flexible, for you can express in principle any thought in it. It is recursive, for you can talk about the world, yourself within the world, and the language within yourself within the world, and so on. Intuitively, it can be that language contains the schemes of human thought, not just as that abstract thing which produced the stream of language, but within the language itself, even though we did not lay down explicitly the algorithm of a human in words. If imitation training can find associations that somehow tap into this recursiveness, it could be that optimizing the imitation of a relatively short amount of human text was sufficient to crack humans.

It seems to me that this is arguing that human language is a special sort of environment where learning to behave like an expert lets you generalise much further out of distribution than for generic environments. That might be the case, and I agree that there's something special about human language that wouldn't be there if we were talking about robots imitating human walking gait for instance. I'm frankly unsure how to think about this, maybe we get unlucky and somehow everything generalises much further than it has any right to do.

I've also been thinking of the failure modes of this whole argument since yesterday, and I think a crucial point is that different abilities of GPT will have different coherent timescales, and some abilities I expect to never get incoherent. For instance, probably grammar will never falter for GPT4 and beyond no matter the sequence length. Because the internet contains enough small grammar deviations followed by correct text that the model can learn to correct them and ignore small mistakes. Importantly, because basically all grammar mistakes exist in the dataset, GPT is unlikely to make a mistake that humans have never made and get itself surprised by an "inhuman mistake".

(If we get unlucky there might be a sequence of abilities, each of which enables the efficient learning of the next, once the coherent timescale gets to infinity. Maybe you can't learn arithmetic until you've learned to eternally generate correct grammar, etc.)

In particular, the Lecun Argument predicts that the regions of language where GPT will have the most trouble will be those regions where "thinking is not done in public", because those will contain very few examples of how to correct perturbations. So maybe it can generate unending coherent unedited forum posts, but scientific papers should be very significantly harder.

Replies from: rotatingpaguro

↑ comment by rotatingpaguro · 2023-04-10T14:49:26.802Z · LW(p) · GW(p)

It's not enough that the agent build the Turing machine that implements the expert, it needs to furthermore modify itself to behave like that Turing machine, otherwise you just have some Turing machine in the environment doing its own thing, and the agent still can't behave like the expert.

I don't care if the agent is "really doing the thing himself" or not. I care that the end result is the overall system imitating the expert. Of course my extreme example is in some sense not useful, I'm saying "the expert is already building the agent you want, so you can imitate it to build the agent you want". The point of the example is showing a simple crisp way the proof fails.

So yeah then I don't know how to clearly move from the very hypothetical counterexample to something less hypothetical. To start, I can have the agent "do the work himself" by having the expert run the machine it defined with its own cognition. This is in principle possible in the autoregressive paradigm, since if you consider the stepping function as the agent, it's fed its previous output. However there's some contrivance in having the expert define the machine in the initial sequence, and then running it, in such a way that the learner gets both the definition and the running part from imitation. I don't have a clear picture in my mind. And next I'd have to transfer somehow the intuition to the domain of human language.

I agree overall with the rest of your analysis, in particular thinking about this in term of threshold coherence lengths. If somehow the learner needs to infer the expert Turing machine from the actions, the relevant point is indeed how long is the specification of such machine.

comment by Jim Bao (jim-bao) · 2024-04-06T17:10:36.462Z · LW(p) · GW(p)

Intuitively, it can be that language contains the schemes of human thought, not just as that abstract thing which produced the stream of language, but within the language itself, even though we did not lay down explicitly the algorithm of a human in words. If imitation training can find associations that somehow tap into this recursiveness, it could be that optimizing the imitation of a relatively short amount of human text was sufficient to crack humans.

This is well said. When does acting becomes indistinguishable from reality? In human world, we certainly have plenty of examples - movie actors, politicians, fake-it-till-you-make-it entrepreneurs. And more frequently, thinking out loud is something many of us practice, where the spoken words do seem to take on their own lives in pushing that abstract thing called thinking.

comment by Sven Nilsen (bvssvni) · 2023-04-10T11:17:03.855Z · LW(p) · GW(p)

The error margin is larger than it appears as estimated from tokens, due to combinatorial complexity. There are many paths that the output can take and still produce an acceptable answer, but an LLM needs to stay consistent depending on which path is chosen. The next level is to combine paths consistently, such that the dependence of one path on another is correctly learned. This means you get a latent space where the LLM learns the representation of the world using paths and paths between paths. With other words, it learns the topology of the world and is able to map this topology onto different spaces, which appear as conversations with humans.

Replies from: rotatingpaguro

↑ comment by rotatingpaguro · 2023-04-10T14:55:04.676Z · LW(p) · GW(p)

I'm not sure I understand correctly: you are saying that it's easier for the LLM to stay coherent in the relevant sense than it looks from LeCun's argument, right?

Replies from: bvssvni

↑ comment by Sven Nilsen (bvssvni) · 2023-04-10T18:08:41.020Z · LW(p) · GW(p)

The error margin LeCun used is for independent probabilities, while in an LLM the paths that the output takes become strongly dependent on each other to stay consistent within a path. Once an LLM masters grammar and use of language, it stays consistent within the path. However, you get the same problem, but now repeated on a larger scale.

Think about it as tiling the plane using a building block which you use to create larger building blocks. The larger building block has similar properties but at a larger scale, so errors propagate, but slower.

If you use independent probabilities, then it is easy to make it look as if LLMs are diverging quickly, but they are not doing that in practice. Also, if you have 99% of selecting one token as output, then this is not observed by the LLM predicting next token. It "observes" its previous token as 100%. There is a chance that the LLM produced the wrong token, but in contexts where the path is predictable, it will learn to output tokens consistently enough to produce coherence.

Humans often think about probabilities as inherently uncertain, because we have not evolved the intuition for nuance to understand probabilities at a deeper level. When an AI outputs actions, probabilities might be thought of as an "interpretation" of the action over possible worlds, while the action itself is a certain outcome in the actual world.

Contra LeCun on "Autoregressive LLMs are doomed"

Contents

Introduction

The argument against autoregressive LLMs

The counter-argument

20 comments