My model of what is going on with LLMs

cole-wyeth

My model of what is going on with LLMs

post by Cole Wyeth (Amyr) · 2025-02-13T03:43:29.447Z · LW · GW · 49 comments

49 comments

Epistemic status: You probably already know if you want to read this kind of post, but in case you have not decided: my impression is that people are acting very confused about what we can conclude about scaling LLMs from the evidence, and I believe my mental model cuts through a lot of this confusion - I have tried to rebut what I believe to be misconceptions in a scattershot way, but will attempt to collect the whole picture here. I am a theoretical computer scientist and this is a theory. Soon I want to do some more serious empirical research around it - but be aware that most of my ideas about LLMs have not had the kind of careful, detailed contact with reality that I would like at the time of writing this post. If you're a good engineer (or just think I am dropping the ball somewhere) and are interested in helping dig into this please reach out. This post is not about timelines, though I think it has obvious implications for timelines.

We have seen LLMs scale to impressively general performance. This does not mean they will soon reach human level because intelligence is not just a knob that needs to get turned further, it comprises qualitatively distinct functions. At this point it is not plausible that we can precisely predict how far we are from unlocking all remaining functions since it will probably require more insights. The natural guess is that the answer is on the scale of decades.

It's important to take a step back and understand the history of how progress in A.I. takes place, following the main line of connectionist algorithms that (in hindsight, back-chaining from the frontier) are load-bearing. This story is relatively old and well-known, but I still need to retell it because I want to make a couple of points clear. First, deep learning has made impressive steps several times over the course of decades. Second, "blind scaling" has contributed substantially but has not been the whole story, conceptual insights piled on top of (and occasionally mostly occluding/obsoleting) each other have been necessary to shift the sorts of things we knew how to train artificial neural nets to do.

Rosenblatt invented the perceptron in 1958. Initially it didn't really take off because of compute, and also ironically because the book "Perceptrons" (published ~10 years later) showed the theoretical limitations of the idea in its nascent form (there weren't enough layers, turns out adding more layers works but you have to invent backpropogation).

Apparently^[1] enthusiasm didn't really ramp up again until 2012, when AlexNet proved shockingly effective at image classification. AlexNet was a convolutional neural network (CNN), a somewhat more complicated idea than the perceptron but clearly a philosophical descendant. Both relied on supervised learning; a nice clear "right/wrong" signal for every prediction and no feedback loops.

Since then there has been pretty steady progress, in the sense that deep learning occasionally shocks everyone by doing what previously seemed far out of reach for AI. Back in the 20-teens, that was mostly playing hard games like Go and Starcraft II. These relied on reinforcement learning, which is notoriously hard to get working in practice, and conceptually distinct from perceptrons and CNNs - though it still used deep learning for function approximation. My impression is that getting deep learning to work at all on new and harder problems usually required inventing new algorithms based on a combination of theory and intuition - it was not just raw scaling, that was not usually the bottleneck. Once we^[2] left the realm of supervised learning every victory was hard fought.

In the 2020's, it has been LLMs that demonstrated the greatest generality and impressiveness. They are powered mostly by supervised learning with transformers, a new architecture that we apparently kind of just stumbled on (the intuitions for attention don't seem compelling to me). Suddenly AI systems can talk a lot like humans do, solve math problems, AND play a decent game of chess sometimes - all with the same system! There is a lot of noise about AGI being ten - no five - no three - no two years away (one of the better received and apparently higher quality examples is Aschenbrenner's "situational awareness", which ironically is also a term for something that LLMs may or may not have). Some people even seem to think it's already here.

It is not, because there is one crucial test (yes this is a crux) that LLMs have not passed. They have never done anything important.

They haven't proven any theorems that anyone cares about. They haven't written anything that anyone will want to read in ten years (or even one year). Despite apparently memorizing more information than any human could ever dream of, they have made precisely zero novel connections or insights in any area of science^[3].

If you model intelligence as a knob that you continuously turn up until it hits and surpasses human intelligence, then this makes no sense.

Clearly LLMs are smarter than humans in some sense: they know more.

But not in some other sense(s): they have done nothing that matters.

I do not know exactly which mental functions are missing from LLMs. I do have a suspicion that these include learning efficiently (that is, in context) and continually (between interactions) and that those two abilities are linked to fundamentally being able to restructure knowledge in a detailed inside-the-box way^[4]. Relatedly, they can't plan coherently over long timescales.

The reason for both of these defects is that the training paradigm for LLMs is (myopic) next token prediction, which makes deliberation across tokens essentially impossible - and only a fixed number of compute cycles can be spent on each prediction (Edit: this is wrong/oversimplified in the sense that the residual streams for earlier positions in the sequence are available at all later positions, though I believe the training method does not effectively incentivize multiple steps of deliberation about the high-level query, see comment from @hmys [LW · GW] for further discussion). This is not a trivial problem. The impressive performance we have obtained is because supervised (in this case technically "self-supervised") learning is much easier than e.g. reinforcement learning and other paradigms that naturally learn planning policies. We do not actually know how to overcome this barrier.

At this point its necessary to spend some time addressing the objections that I anticipate, starting with the most trivial.

My position is NOT that LLMs are "stochastic parrots." I suspect they are doing something akin to Solomonoff induction with a strong inductive bias in context - basically, they interpolate, pattern match, and also (to some extent) successfully discover underlying rules in the service of generalization. My mental picture of the situation is that the knowledge we have fed into LLMs forms a fairly dense but measure ~0 "net." Nearly every query misses the net but is usually close-ish to some strand, and LLMs use the cognitive algorithms reflected in that strand to generalize. I am fascinated that this is possible.

I am aware that perfect sequence prediction would pretty much require solving all problems. For instance, the most likely answer to a complicated never-seen-before question is probably the right answer, so if LLMs were perfectly calibrated through their training process, they would basically be omniscient oracles, which could easily be bootstrapped into powerful agents - but actually, I would guess that other things break down far before that point. The kind of cognitive algorithm that could approach perfection on sequence prediction would have to solve lots of problems of deliberation, which would in practice require agency. However, deep learning is not perfect at optimizing its objective. In a sense, this entire line of inquiry is misguided - in fact, it almost works in the opposite direction it is usually offered in: because perfect prediction would require highly general agency, and because we do not know how to teach AI systems highly general agency, we shouldn't expect to get anywhere close to perfect prediction. Certainly without access to a gear-level model of why deep learning works so well one could imagine transformers just getting better and better at prediction without limit, and this has continued for a surprisingly long time, but it seems to be stalling (where is the GPT-5 level series of models?) and this is exactly what we should expect from history and our (imperfect, heuristic) domain knowledge.

Now, there is a lot of optimism about giving LLMs more compute at inference time to overcome their limitations. Sometimes the term "unhobbling" is thrown around. The name is misleading. It's not like we have hobbled LLMs by robbing them of long term memory, and if we just do the common sense thing and give them a scratchpad they'll suddenly be AGI. That is not how these things work - its the kind of naive thinking that led symbolic AI astray. The implicit assumption is that cognition is a sequence of steps of symbol manipulation, and each thought can be neatly translated into another symbol which can then be carried forward at the next step without further context. Now, in principle something like this must be theoretically possible (a Turing machine is just a symbol manipulation machine) but a level of abstraction or so is getting dropped here - usefully translating a rich cognitive process into discrete symbols is a hard bottleneck in practice. We don't actually know how to do it and there is no reason to expect it will "just work." In fact I strongly suspect that it can't.

Another objection, which I take somewhat-but-not-very seriously, is that even if LLMs remain limited they will rapidly accelerate AI research to the extent that AGI is still near. It's not clear to me that this is a reasonable expectation. Certainly LLMs should be useful tools for coding, but perhaps not in a qualitatively different way than the internet is a useful tool for coding, and the internet didn't rapidly set off a singularity in coding speed. In fact, I feel that this argument comes from a type of motivated reasoning - one feels convinced that the singularity must be near, sees some limitations with LLMs that aren't obviously tractable, and the idea of LLMs accelerating AI progress "comes to the rescue." In other words, is that your true objection?

Okay, so now I feel like I have to counter not an argument but a sort of vibe. You interact with a chatbot and you feel like "we must be close, I am talking to this intelligent thing which reminds me more of humans than computers." I get that. I think that vibe makes people invent reasons that it has to work - ah but if we just let it use a chain of thought, if we just use some RLHF, if we just etc.

Sure, I cannot prove to you that none of those things will work. Also, no one can prove that deep learning will work, generally - its pretty much just gradient descent on an objective that is not convex (!) - and yet it does work. In other words, we don't know exactly what is going to happen next.

However, I am asking you not to mistake that vibe for a strong argument, because it is not. On the outside view (and I believe also on the inside view) the most plausible near future looks something like the past. Some more conceptual insights will be necessary. They will accrue over the course of decades with occasional shocks, but each time that we solve something previously believed out of reach, it will turn out that human level generality is even further out of reach after all. Intelligence is not atomic. It has component functions, and we haven't got all of them yet. We don't even know what they are - at least, not the ones we need in practice for agents that can actually be built inside our universe^[5].

Also, I believe the writing is on the wall at this point. It was reasonable to think that maybe transformers would just work and soon when we were racing through GPT-2, GPT-3, to GPT-4. We just aren't in that situation anymore, and we must propagate that update fully through our models and observe that the remaining reasons to expect AGI soon (e.g. "maybe all of human intelligence is just chain of thoughts") are not strong.

Of course, whenever we actually invent human level AGI the course of history will be disrupted drastically. I am only saying that point may well be pretty far off still, and I do not think it is reasonable to expect it inside a few years.

^{^}
I'd be interested to learn more about the trajectory of progress in connectionist methods during the intervening years?
^{^}
Below "we" always means "humanity" or "AI researchers." Usually I have nothing to do with the people directly involved.
^{^}
Of course, narrow systems likes AlphaFold have continued to solve narrow problems - this just doesn't have much to do with AGI. Small teams of really smart people with computers can solve hard problems - it is nice and not surprising.
^{^}
This is a phrase I have borrowed in its application as a positive ability from Carl Sturtivant. Its meant to capture about the same vibe as "gears level."
^{^}
That is, barring AIXI.

49 comments

Comments sorted by top scores.

comment by hmys (the-cactus) · 2025-02-14T14:57:57.196Z · LW(p) · GW(p)

Great post. I agree with the "general picture", however, the proposed argument for why LLMs have some of these limitations, seems to me clearly wrong.

The reason for both of these defects is that the training paradigm for LLMs is (myopic) next token prediction, which makes deliberation across tokens essentially impossible - and only a fixed number of compute cycles can be spent on each prediction. This is not a trivial problem. The impressive performance we have obtained is because supervised (in this case technically "self-supervised") learning is much easier than e.g. reinforcement learning and other paradigms that naturally learn planning policies.

Transformers form internal representations at each token position, and gradients flow backwards in time because of attention.

This means the internal representation a model forms at token A, is incentiviced to be useful for predicting the token after A, but also tokens 100 steps later than A. So while LLMs are technically myopic wrt the exact token they write (sampling discretizes and destroys gradients), they are NOT incentiviced to be myopic wrt the internal representations they form, which is clearly the important part in my view (the vast vast majority of the information in a transformers processing lies there, and this information is enough to determine which token it ends up writing), even though they are trained on a myopic next token objective.

For example, a simple LLM transformer might look like this (left to right, token position, upwards is as it moves through transformer layers at each token position. Assume A0 was a starting token, and B0-E0 were sampled autoregressively)

A2 -> B2 -> C2 -> D2 -> E2

^ ^ ^ ^ ^

A1 -> B1 -> C1 -> D1 -> E1

^ ^ ^ ^ ^

A0 -> B0 -> C0 -> D0 -> E0

In this picture, there is no gradient that goes from A1 to E2 through B0, the immediate next token A1 contributes to writing. But A1 has direct contributions to B2, C2, D2 and E2 because of attention, and A1 being useful for helping B2,C2 etc do their predictions will create lower loss. So gradient descent will heavily incentivize A1 containing a representation thats useful for helping make accurate predictions arbitrarily far into the future. (well, at least a million token into the future or however big the context window is).

Overall, I think its a big mistake to think of LLMs training objective being myopic, as having much to say about how myopic LLMs will be after they've been trained, or how myopic their internals are.

Replies from: Amyr

↑ comment by Cole Wyeth (Amyr) · 2025-02-14T16:52:27.688Z · LW(p) · GW(p)

You're totally right - I knew all of the things that should have let me reach this conclusion, but I was still thinking about the residual stream in the upwards direction on your diagram as doing all of the work from scratch, just sort of glancing back at previous tokens through attention, when it can also look at all the previous residual streams.

This does invalidate a fairly load-bearing part of my model, in that I now see that LLMs have a meaningful ability to "consider" a sequence in greater and greater depth as its length grows - so they should be able to fit more thinking in when the context is long (without hacks like chain of thoughts etc.).

Other parts of my mental model still hold up though. While this proves that LLMs should be better at figuring out how to predict sequences than I thought previously (possibly even inventing sophisticated mental representations of the sequences) I still don't expect methods of deliberation based on sampling long chains of reasoning from an LLM to work - that isn't directly tied to sequence prediction accuracy, it would require a type of goal-directed thinking which I suspect we do not know how to train effectively. That is, the additional thinking time is still spent on token prediction, potentially of later tokens, but not on choosing to produce tokens that are useful for further reasoning about the given task = actual user query, except insofar as that is useful for next toke prediction. RLHF changes the story, but as discussed I do not expect it to be a silver bullet.

Replies from: william-wale

↑ comment by williawa (william-wale) · 2025-02-21T19:11:14.407Z · LW(p) · GW(p)

I'm not sure this is entirely correct. It's still true that transformers are bounded in the amount of computation they can do in the residual stream, before the computation has to "cash out" in a predicted token though. In the picture above, its a little unclear, but C2 can only read from A1, B1 and C1, not A2 and B2. There is a maximum length path of computation from one token input to a token output.

The above only establishes that LLM training doesn't incentivize them to be myopic. Like if you ask a LLM to continue the string "What is 84x53? Answer with" then the next few tokens to predict might be " one word. Answer:" or something like " an explanation before you give the final number."

The above argument just shows that the LLM still might internally be thinking about what 84x53 is on the residual stream of the " Answer" and the " with" token, even if that only has relevance for later tokens, and it can easily figure out " one word", or " an explanation before you give the final number.", without computing the answer.

If you prompt a model with two sentences, they're probably "thinking" about a bunch of stuff that's relevant for predicting words many many sentences later.

But they can't have that complex thoughts unless they write them down. Or, obviously if you just make them bigger they can have more and more complex thoughts, but you'd expect the thoughts they be able to have when they can write stuff, to be a lot more complex than if they have to for example thin deceptive things that don't appear in writing.

I mean, I don't want to give Big Labs any ideas, but I suspect the reasoning above implies that the o1/deepseek -style RL procedures might work a lot better if they can think internally for a long time, like the thinking in embedding space model, because gradients from the reward tokens don't really flow from the placed tokens now. The placed tokens are kind of like the environment in standard RL thinking, but they could actually be differentiated through, turning it more into a standard supervised problem, which is a lot easier than open-ended RL.

Replies from: Amyr, o-o

↑ comment by Cole Wyeth (Amyr) · 2025-02-22T01:45:06.140Z · LW(p) · GW(p)

How do you square that with Algorithm 10 here: https://arxiv.org/pdf/2207.09238? See Appendix B for the list of notation, should save some time if you don't want to read the whole thing.

(Nice resource by the way, the only place I have seen anyone write down a proper pseudo-code algorithm for transformers)

Seems to match the diagram from @hmys [LW · GW] in that the entire row to the left of a position goes into its multiheaded attention operation - NOT the original input tokens.

Replies from: william-wale

↑ comment by williawa (william-wale) · 2025-02-22T17:02:21.321Z · LW(p) · GW(p)

Yeah, its not just the tokens. It does look at the previous residual streams, What I'm saying is just that each token, the model can only think about internally a fixed amount, bounded by the number of layers. It can NOT think for longer, without writing down its thoughts, as the context grows.

In the article you linked, X is the residual stream, it is a tensor with dimension (length of sequence input) x (dimension of model). But X goes through multiple updates, where each only depends on the previous layer. He is the loop unrolled if L = 2.

So X0 = Embed + PosEmbed

X1 = X0 + MultiheadAttention1(X0)

X2 = X1 + MLP1(X1)

X3 = X2 + MultiheadAttention2(X2)

X4 = X3 + MLP2(X3)

Out = Softmax(Unembed(X4))

The point is that its not like X1[i] = f(X1[:i]). X1 is a function of X_0 only. So the maximum length of any computational path is the HEIGHT of HMYS' diagram, not the length. You can't have any computation going from A1 -> B1. Only from A1 -> B2. Thats what hmys says also

But A1 has direct contributions to B2, C2, D2 and E2 because of attention,

So, unlike in the diagram above. You can't go immediately to the right, only to the right and up in one computation step.

NOTE: You can also see this just by looking at the code in the document you sent. The for loop is just ran a constant L times. No matter what. What is L? The number of transformer layers. Each innor loop does a fixed amount of computation. And the only thing that changes from time to time, is that there are new tokens written (assuming we're autoregressively sampling from the transformer in a loop). Ergo, if the model isn't communicating its "thinking" in writing, it can't think for longer, as the context grows.

Replies from: Amyr

↑ comment by Cole Wyeth (Amyr) · 2025-02-22T17:59:58.038Z · LW(p) · GW(p)

I see, you’re saying that since information flows one step up at each application of MHAttention, no computation path is actually longer than the depth L.

This seems to be right - that means the second paragraph of my comment is wrong, and I “updated too far” towards LLMs being able to think for a long time. But I was also wrong initially since they can at least remember most of their thinking from previous steps - they just don’t get additional time to build on it above the bound L.

Replies from: william-wale

↑ comment by williawa (william-wale) · 2025-02-23T21:34:57.580Z · LW(p) · GW(p)

Yeah, that's my understanding as well. Tell me if your understanding changes further in relevant ways.

↑ comment by O O (o-o) · 2025-02-28T17:47:26.121Z · LW(p) · GW(p)

I mean, I don't want to give Big Labs any ideas, but I suspect the reasoning above implies that the o1/deepseek -style RL procedures might work a lot better if they can think internally for a long time

I expect gpt 5 to implement this. Based on recent research and how they phrase it.

Replies from: Amyr

↑ comment by Cole Wyeth (Amyr) · 2025-02-28T18:09:57.658Z · LW(p) · GW(p)

Yes, this is the type of idea big labs will definitely already have (also what I think ~100% of the time someone says "I don't have to give big labs any ideas").

Replies from: william-wale

↑ comment by williawa (william-wale) · 2025-02-28T22:03:47.624Z · LW(p) · GW(p)

That's what I also thought haha, else I wouldn't post it.

comment by Vladimir_Nesov · 2025-02-14T20:16:29.426Z · LW(p) · GW(p)

It was reasonable to think that maybe transformers would just work and soon when we were racing through GPT-2, GPT-3, to GPT-4. We just aren't in that situation anymore

There remains about 2,000x in scaling of raw compute from GPT-4 (2e25 FLOPs) to $150bn training systems of 2028 (5e28 FLOPs), more in effective compute from improved architecture over 6 years^[1]. That's exactly the kind of situation we were in between GPT-2, GPT-3, and GPT-4, not knowing what the subsequent levels of scaling would bring. So far the scaling experiment demonstrated significantly increasing capabilities, and we are not even 100x up from GPT-4 yet to get the first negative result.

More than this on the same schedule would require much better capabilities, but this much seems plausible in any case, so describes the scale of the experiment we'll get to see shortly in case capabilities actually stop improving, the strength of the negative result. ↩︎

Replies from: Amyr

↑ comment by Cole Wyeth (Amyr) · 2025-02-14T21:03:01.985Z · LW(p) · GW(p)

Yeah, that sentence may have been too strong.

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2025-02-16T16:03:50.959Z · LW(p) · GW(p)

It's not just too strong, it's also a reminder that we need to get used to waiting.

Even under short timelines, things will not move that fast, and we have not yet gotten large negative results, so the scaling case remains reasonable, so we kinda have to get used to hurrying up and waiting.

comment by abramdemski · 2025-02-21T20:01:33.499Z · LW(p) · GW(p)

They haven't proven any theorems that anyone cares about. They haven't written anything that anyone will want to read in ten years (or even one year). Despite apparently memorizing more information than any human could ever dream of, they have made precisely zero novel connections or insights in any area of science^[3].

An anecdote I heard through the grapevine: some chemist was trying to synthesize some chemical. He couldn't get some step to work, and tried for a while to find solutions on the internet. He eventually asked an LLM. The LLM gave a very plausible causal story about what was going wrong and suggested a modified setup which, in fact, fixed the problem. The idea seemed so hum-drum that the chemist thought, surely, the idea was actually out there in the world and the LLM had scraped it from the internet. However, the chemist continued searching and, even with the details in hand, could not find anyone talking about this anywhere. Weak conclusion: the LLM actually came up with this idea due to correctly learning a good-enough causal model generalizing not-very-closely-related chemistry ideas in its training set.

Weak conclusion: there are more than precisely zero novel scientific insights in LLMs.

comment by AnthonyC · 2025-02-13T12:30:45.132Z · LW(p) · GW(p)

Great post. I think the central claim is plausible, and would very much like to find out I'm in a world where AGI is decades away instead of years. We might be ready by then.

If I am reading this correctly, there are two specific tests you mention:

1) GPT-5 level models come out on schedule (as @Julian Bradshaw [LW · GW] noted, we are still well within the expected timeframe based on trends to this point)

2) LLMs or agents built on LLMs do something "important" in some field of science, math, or writing

I would add on test 2 that neither have almost all humans. We don't have a clear explanation for why some humans have much more of this capability than others, and yet all the human brains are running on similar hardware and software. This suggests the number of additional insights needed to boost us from "can't do novel important things" to "can do" may be as small as zero, though I don't think it is actually zero. In any case, I am hesitant to embrace a test for AGI that a large majority of humans fail.

In practical terms, suppose this summer OpenAI releases GPT-5-o4, and by winter it's the lead author on a theoretical physics or pure math paper (or at least the main contributor - legal considerations about personhood and IP might stop people from calling AI the author). How would that affect your thinking?

Replies from: Amyr

↑ comment by Cole Wyeth (Amyr) · 2025-02-13T15:30:52.873Z · LW(p) · GW(p)

I think the central claim is plausible, and would very much like to find out I'm in a world where AGI is decades away instead of years. We might be ready by then.

Me too!

If I am reading this correctly, there are two specific tests you mention:
1) GPT-5 level models come out on schedule (as @Julian Bradshaw [LW · GW] noted, we are still well within the expected timeframe based on trends to this point)

See my response to his comment - I think its not so clear that projecting those trends invalidates my model, but it really depends on whether GPT-5 is actually a qualitative upgrade comparable to the previous steps, which we do not know yet.

2) LLMs or agents built on LLMs do something "important" in some field of science, math, or writing
I would add on test 2 that neither have almost all humans. We don't have a clear explanation for why some humans have much more of this capability than others, and yet all the human brains are running on similar hardware and software. This suggests the number of additional insights needed to boost us from "can't do novel important things" to "can do" may be as small as zero, though I don't think it is actually zero. In any case, I am hesitant to embrace a test for AGI that a large majority of humans fail.

This seems about right, but there are two points to keep in mind.

a) It is more surprising that LLMs can't do anything important because their knowledge far surpasses any humans, which indicates that there is some kind of cognitive function qualitatively missing.

b) I think that about the bottom 30% (very rough estimate) of humans in developed nations are essentially un-agentic. The kind of major discoveries and creations I pointed to mostly come from the top 1%. However, I think that in the middle of that range there are still plenty of people capable of knowledge work. I don't see LLMs managing the sort of project that would take a mediocre mid-level employee a week or month. So there's a gap here, even between LLMs and ordinary humans. I am not as certain about this as I am about the stronger test, but it lines up with my experience with DeepResearch - I asked it for a literature review of my field and it had pretty serious problems that would have made it unusable, despite requiring ~no knowledge creation (I can email you an annotated copy if you're interested).

In practical terms, suppose this summer OpenAI releases GPT-5-o4, and by winter it's the lead author on a theoretical physics or pure math paper (or at least the main contributor - legal considerations about personhood and IP might stop people from calling AI the author). How would that affect your thinking?

Assuming the results of the paper are true (everyone would check) and at least somewhat novel/interesting (~sufficient for the journal to be credible) this would completely change my mind. As I said, it is a crux.

Replies from: AnthonyC

↑ comment by AnthonyC · 2025-02-13T16:24:42.746Z · LW(p) · GW(p)

Fair enough, thanks.

My own understanding is that other than maybe writing code, no one has actually given LLMs the kind of training a talented human gets towards becoming the kind of person capable of performing novel and useful intellectual work. An LLM has a lot of knowledge, but knowledge isn't what makes useful and novel intellectual work achievable. A non-reasoning model gives you the equivalent of a top-of-mind answer. A reasoning model with a large context window and chain of thought can do better, and solve more complex problems, but still mostly those within the limits of a newly hired college or grad student.

I genuinely don't know whether an LLM with proper training can do novel intellectual work at current capabilities levels. To find out in a way I'd find convincing would take someone giving it the hundreds of thousands of dollars and subjective years' worth of guidance and feedback and iteration that humans get. And really, you'd have to do this at least hundreds of times, for different fields and with different pedagogical methods, to even slightly satisfactorily demonstrate a "no," because 1) most humans empirically fail at this, and 2) those that succeed don't all do so in the same field or by the same path.

comment by Julian Bradshaw · 2025-02-13T09:22:45.250Z · LW(p) · GW(p)

soon when we were racing through GPT-2, GPT-3, to GPT-4. We just aren't in that situation anymore

I don't think this is right.

GPT-1: 11 June 2018
GPT-2: 14 February 2019 (248 days later)
GPT-3: 28 May 2020 (469 days later)
GPT-4: 14 March 2023 (1,020 days later)

Basically, wait until next model doubled every time. By that pattern, GPT-5 ought to come around September 20, 2028, but Altman said today it'll be out within months. (and frankly, I think o1 qualifies as a sufficiently-improved successor model, and that released December 5, 2024, or really September 12, 2024, if you count o1-preview; either way, shorter than the GPT-3 to 4 gap)

Replies from: Thane Ruthenis, Amyr

↑ comment by Thane Ruthenis · 2025-02-13T18:46:29.863Z · LW(p) · GW(p)

GPT-5 ought to come around September 20, 2028, but Altman said today it'll be out within months

I don't think what he said meant what you think it meant. Exact words:

In both ChatGPT and our API, we will release GPT-5 as a system that integrates a lot of our technology, including o3

The "GPT-5" he's talking about is not the next generation of GPT-4, not an even bigger pretrained LLM. It is some wrapper over GPT-4.5/Orion, their reasoning models, and their agent models. My interpretation is that "GPT-5" the product and GPT-5 the hypothetical 100x-bigger GPT model, are two completely different things.

I assume they're doing these naming shenanigans specifically to confuse people and create the illusion of continued rapid progress. This "GPT-5" is probably going to look pretty impressive, especially to people in the free tier, who've only been familiar with e. g. GPT-4o so far.

Anyway, I think the actual reason the proper GPT-5 – that is, an LLM 100+ times as big as GPT-4 – isn't out yet is because the datacenters powerful enough to train it are only just now coming online [LW · GW]. It'll probably be produced in 2026.

Replies from: Julian Bradshaw

↑ comment by Julian Bradshaw · 2025-02-14T00:19:23.507Z · LW(p) · GW(p)

It's unclear exactly what the product GPT-5 will be, but according to OpenAI's Chief Product Officer today it's not merely a router between GPT-4.5/o3.

swyx
appreciate the update!!
in gpt5, are gpt* and o* still separate models under the hood and you are making a model router? or are they going to be unified in some more substantive way?

Kevin Weil
Unified 👍

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2025-02-14T00:31:51.789Z · LW(p) · GW(p)

Fair enough, I suppose calling it an outright wrapper was an oversimplification. It still basically sounds like just the sum of the current offerings.

↑ comment by Cole Wyeth (Amyr) · 2025-02-13T15:17:13.255Z · LW(p) · GW(p)

Wow, crazy timing for the GPT-5 announcement! I'll come back to that, but first the dates that you helpfully collected:

It's not clear to me that this timeline points in the direction you are arguing. Exponentially increasing time between "step" improvements in models would mean that progress rapidly slows to the scale of decades. In practice this would probably look like a new paradigm with more low-hanging fruit overtaking or extending transformers.

I think your point is valid in the sense that things were already slowing down by GPT-3 -> GPT-4, which makes my original statement at least potentially misleading. However, research and compute investment have also been ramping up drastically - I don't know by exactly how much, but I would guess nearly an order of magnitude? So the wait times here may not really be comparable.

Anyway, this whole speculative discussion will soon (?) be washed out when we actually see GPT-5. The announcement is perhaps a weak update against my position, but really the thing to watch is whether it is a qualitative improvement on the scale of previous GPT-N -> GPT-(N+1). If it is, then you are right that progress has not slowed down much. My standard is whether it starts doing anything important.

Replies from: Julian Bradshaw

↑ comment by Julian Bradshaw · 2025-02-13T20:58:13.716Z · LW(p) · GW(p)

You're right that there's nuance here. The scaling laws involved mean exponential investment -> linear improvement in capability, so yeah it naturally slows down unless you go crazy on investment... and we are, in fact, going crazy on investment. GPT-3 is pre-ChatGPT, pre-current paradigm, and GPT-4 is nearly so. So ultimately I'm not sure it makes that much sense to compare the GPT1-4 timelines to now. I just wanted to note that we're not off-trend there.

comment by Robert Cousineau (robert-cousineau) · 2025-02-13T06:11:04.789Z · LW(p) · GW(p)

I found this to be a valuable post!

I disagree with your conclusion though - the thoughts that come to my mind as to why are:

You seem overly anchored on COT as the only scaffolding system in the near-mid future (2-5 years). While I'm uncertain what specific architectures will emerge, the space of possible augmentations (memory systems, tool use, multi-agent interactions, etc.) seems vastly larger than current COT implementations.
Your crux that "LLMs have never done anything important" feels only mildly compelling. Anecdotally, many people do feel LLM's significantly improve their ability to do important and productive work, both work which requires creativity/cross field information integration and work that does not.
Further, I am not aware of any large scale ($10 million+) instances of people trying something like a better version of "Ask an LLM to list out in context fields it feels like would be ripe for information integration leading to a breakthrough, and then do further reasoning on what those breakthroughs are/actually perform them."
Something like that seems like it would be a MVP of "actually try and get an LLM to come up with something significantly economically valuable. I expect that the lack of this type of experiment existing is because major AI labs feel like that would be choosing to exploit while there are still many gains to be made from exploring further architectural and scaffolding-esque improvements.
Where you say "Certainly LLMs should be useful tools for coding, but perhaps not in a qualitatively different way than the internet is a useful tool for coding, and the internet didn't rapidly set off a singularity in coding speed.", I find this to be untrue both in terms of the impact of the internet (while it did not cause a short takeoff, it did dramatically increase the amount of new programmers and the effective transfer of information between them. I expect without it we would see computers having <20% of their current economic impact), and in terms of the current and expected future impact of LLM's (LLM's simply are widely used by smart/capable programmers. I trust them to evaluate if it is noticeably better than StackOverflow/the rest of the internet).

Replies from: Amyr

↑ comment by Cole Wyeth (Amyr) · 2025-02-13T15:43:08.898Z · LW(p) · GW(p)

You seem overly anchored on COT as the only scaffolding system in the near-mid future (2-5 years). While I'm uncertain what specific architectures will emerge, the space of possible augmentations (memory systems, tool use, multi-agent interactions, etc.) seems vastly larger than current COT implementations.

COT (and particularly the extension tree of thoughts) seems like the strongest of those to me, probably because I can see an analogy to Solomonoff induction -> AIXI. I am curious whether you have some particular more sophisticated memory system in mind?

My point is that these are all things that might work, but there is no strong reason to think they will - particularly to the extent of being all that we need. AI progress is usually on the scale of decades and often comes from unexpected places (though for the main line, ~always involving a neural net in some capacity).

Something like that seems like it would be a MVP of "actually try and get an LLM to come up with something significantly economically valuable. I expect that the lack of this type of experiment existing is because major AI labs feel like that would be choosing to exploit while there are still many gains to be made from exploring further architectural and scaffolding-esque improvements.

I find this kind of hard to swallow - a huge number of people are using and researching LLMs, I suspect that if something like this "just works" we would know by now. I mean, it would certainly win a lot of acclaim for the first group to pull it off, so the incentives seem sufficient - and it doesn't seem that hard to pursue this in parallel to basic research on LLMs. Plus, the two investments are synergistic; for example, one would probably learn about the limitations of current models by pursuing this line. Maybe Anthropic is too small and focused to try it, but GDM could easily spin off a team.

Where you say "Certainly LLMs should be useful tools for coding, but perhaps not in a qualitatively different way than the internet is a useful tool for coding, and the internet didn't rapidly set off a singularity in coding speed.", I find this to be untrue both in terms of the impact of the internet (while it did not cause a short takeoff, it did dramatically increase the amount of new programmers and the effective transfer of information between them. I expect without it we would see computers having <20% of their current economic impact), and in terms of the current and expected future impact of LLM's (LLM's simply are widely used by smart/capable programmers. I trust them to evaluate if it is noticeably better than StackOverflow/the rest of the internet).

I expect LLMs to offer significant advantages above the internet. I am simply pointing out that not every positive feedback loop is a singularity. I expect great coding assistants (essentially excellent autocomplete) but not drop-in replacements for software engineers any time soon. This is one factor that will increase the pace of AI research somewhat, but also Moore's law is running out, which will definitely slow the pace. Not sure which one wins out directionally.

comment by Vladimir_Nesov · 2025-02-14T20:35:38.251Z · LW(p) · GW(p)

We have seen LLMs scale to impressively general performance. This does not mean they will soon reach human level because intelligence is not just a knob that needs to get turned further, it comprises qualitatively distinct functions.

Deep learning AI research isn't concerned with inventing AIs, it's concerned with inventing AI training processes. For human minds, the relevant training process is natural selection, which doesn't have nearly as many qualitatively distinct functions as human minds do. Scaling lets the same training process produce more qualitatively distinct functions in a resulting model.

For the straightforward thing to have a chance of working, training processes need to scale at all (which they historically often didn't), they need to be able to represent the computations that would exhibit the capabilities (long reasoning traces are probably sufficient to bootstrap general intelligence, given the right model weights), and the feasible range of scaling needs to actually encounter new capabilities.

It's unknown which capabilities will be encountered as we scale from 20 MW systems that trained GPT-4 to 5 GW training systems. This level of scaling will happen soon, by 2028-2030, regardless of its success in capabilities. If it fails to show significant progress, 2030-2040 will be slower without new ideas, though research funding is certainly skyrocketing and there are still whole orchards of low hanging fruit. So scaling of training systems concentrates the probability of significant capability progress into the next few years, whatever that probability is, relative to subsequent several years.

Replies from: Amyr

↑ comment by Cole Wyeth (Amyr) · 2025-02-14T20:45:58.838Z · LW(p) · GW(p)

Yes, I agree with almost all of that, particularly:

Deep learning AI research isn't concerned with inventing AIs, it's concerned with inventing AI training processes.

Technically it deep learning research is concerned with inventing AIs, but lately through inventing AI training processes.

The only part I either don't understand or don't agree with is:

long reasoning traces are probably sufficient to bootstrap general intelligence, given the right model weights

Though a simple training process certainly find diverse functions, I don't think the current paradigm will get all of the ones needed for AGI.

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2025-02-14T21:08:43.528Z · LW(p) · GW(p)

Scaling of pretraining makes System 1 thinking stronger, o1/R1-like training might end up stitching together a functional System 2 applicable in general that wouldn't need to emerge fully formed as a result of iteratively applying System 1. If the resulting model is sufficiently competent to tinker with AI training processes [LW(p) · GW(p)], that might be all it takes for it to quickly fix remaining gaps in capability. In particular, if it's able to generate datasets and run some RL post-training in order to make further progress on particular problems, this might be a good enough crutch while online learning is absent and original ideas can only form as a result of fiddly problem specific RL post-training that needs to be set up first.

Replies from: Amyr

↑ comment by Cole Wyeth (Amyr) · 2025-02-14T23:00:21.005Z · LW(p) · GW(p)

It might go that way, but I don't see strong reasons to expect it.

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2025-02-14T23:48:35.043Z · LW(p) · GW(p)

It's an argument about long reasoning traces having sufficient representational capacity to bootstrap general intelligence, not forecasting that the bootstrapping will actually occur. It's about a necessary condition for straightforward scaling to have a chance of getting there, at an unknown level of scale.

comment by Nate Showell · 2025-02-14T02:04:49.198Z · LW(p) · GW(p)

LLMs are more accurately described as artificial culture instead of artificial intelligence. They've been able to achieve the things they've achieved by replicating the secret of our success, and by engaging in much more extensive cultural accumulation (at least in terms of text-based cultural artifacts) than any human ever could. But cultural knowledge isn't the same thing as intelligence, hence LLMs' continued difficulties with sequential reasoning and planning.

comment by p.b. · 2025-02-13T19:41:55.377Z · LW(p) · GW(p)

Apparently^[1] enthusiasm didn't really ramp up again until 2012, when AlexNet proved shockingly effective at image classification.

I think after the backpropagation paper was published in the eighties enthusiasm did ramp up a lot. Which lead to a lot of important work in the nineties like (mature) CNNs, LSTMs, etc.

Replies from: Amyr

↑ comment by Cole Wyeth (Amyr) · 2025-02-13T19:51:41.369Z · LW(p) · GW(p)

I see - I mean, clearly AlexNet didn't just invent all the algorithms it relied on, I believe the main novel contribution was to train on GPU's and get it working well enough to blow everything else out of the water?

The fact that it took decades of research to go from the Perceptron to great image classification indicates to me that there might be further decades of research between holding an intelligent-ish conversation and being a human agent level agent. This seems like the natural expectation given the story so far, no?

Replies from: p.b.

↑ comment by p.b. · 2025-02-13T20:48:33.951Z · LW(p) · GW(p)

I think AlexNet wasn't even the first to win computer vision competitions based on GPU-acceleration but that was definitely the step that jump-started Deep Learning around 2011/2012.

To me it rather seems like agency and intelligence is not very intertwined. Intelligence is the ability to create precise models - this does not imply that you use these models well or in a goal-directed fashion at all.

That we have now started the path down RLing the models to make them pursue the goal of solving math and coding problems in a more directed and effective manner implies to me that we should see inroads to other areas of agentic behavior as well.

Whether that will be slow going or done next year cannot really be decided based on the long history of slowly increasing the intelligence of models because it is not about increasing the intelligence of models.

Replies from: Amyr

↑ comment by Cole Wyeth (Amyr) · 2025-02-13T20:50:31.198Z · LW(p) · GW(p)

Our intuitions here should be informed by the historical difficulty of RL.

Replies from: p.b.

↑ comment by p.b. · 2025-02-13T21:10:57.926Z · LW(p) · GW(p)

But the historical difficulty of RL is based on models starting from scratch. Unclear whether moulding a model that already knows how to do all the steps into doing all the steps is anywhere as difficult as using RL to also learn how to do all the steps.

comment by Thane Ruthenis · 2025-02-13T19:27:55.209Z · LW(p) · GW(p)

I think that's a pretty plausible way the world could be, yes.

I still expect the Singularity somewhere in the 2030s, even under that model.

Replies from: mateusz-baginski, Amyr

↑ comment by Mateusz Bagiński (mateusz-baginski) · 2025-02-21T12:06:52.256Z · LW(p) · GW(p)

I still expect the Singularity somewhere in the 2030s, even under that model.

Have you written up your model of AI timelines anywhere?

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2025-03-05T16:43:40.827Z · LW(p) · GW(p)

Here you go [LW · GW].

↑ comment by Cole Wyeth (Amyr) · 2025-02-13T19:52:51.074Z · LW(p) · GW(p)

I think things will be "interesting" by 2045 in one way or another - so it sounds like our disagreement is small on a log scale :)

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2025-02-14T19:23:02.720Z · LW(p) · GW(p)

This is basically Kurzweilian timelines.

IMO, my latest dates are from 2040-2050, and if it doesn't happen by then, then I'll consider AI to likely never reach what people on LW thought.

In my lifetime.

Replies from: Amyr

↑ comment by Cole Wyeth (Amyr) · 2025-02-14T19:38:46.322Z · LW(p) · GW(p)

Yes, transhumanists used to say 2045 and it was considered a bit aggressive, times have changed!

IMO, my latest dates are from 2040-2050, and if it doesn't happen by then, then I'll consider AI to likely never reach what people on LW thought.

What? I have a good 20-25% on AGI a few decades after we understand the brain, and the former could easily be 100-250 years out. Probably other stuff accelerates a lot by then but who knows!

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2025-02-14T19:54:43.650Z · LW(p) · GW(p)

What? I have a good 20-25% on AGI a few decades after we understand the brain, and the former could easily be 100-250 years out. Probably other stuff accelerates a lot by then but who knows!

Fair enough, reworded that statement.

comment by abramdemski · 2025-02-23T17:34:59.938Z · LW(p) · GW(p)

My position is NOT that LLMs are "stochastic parrots." I suspect they are doing something akin to Solomonoff induction with a strong inductive bias in context - basically, they interpolate, pattern match, and also (to some extent) successfully discover underlying rules in the service of generalization.

I think non-reasoning models such as 4o and Claude are better-understood as doing induction with a "circuit prior" which is going to be significantly different from Solomonoff (longer-running programs require larger circuits, which will be penalized).

Reasoning models such as o1 and r1 are in some sense Turing-complete, and so, much more akin to Solomonoff. Of course, the RL used in such models is not training on the prediction task like Solomonoff Induction.

comment by Noosphere89 (sharmake-farah) · 2025-02-17T17:41:32.222Z · LW(p) · GW(p)

For an algorithmic advance that might be relevant, this paper has a new model that apparently scales without any currently known bound, and it's a recurrent architecture that actually works.

This is moderately spooky, both because of the fact that it works at all, combined with the fact that it does work being a signal for researchers to try to improve the architecture, and given the funding to AI, lots of money might come soon to these approaches:

https://arxiv.org/abs/2502.05171

comment by Nick_Tarleton · 2025-02-13T20:19:36.935Z · LW(p) · GW(p)

The impressive performance we have obtained is because supervised (in this case technically "self-supervised") learning is much easier than e.g. reinforcement learning and other paradigms that naturally learn planning policies. We do not actually know how to overcome this barrier.

What about current reasoning models trained using RL? (Do you think something like, we don't know, and won't easily figure out, how to make that work well outside a narrow class of tasks that doesn't include 'anything important'?)

Replies from: Amyr

↑ comment by Cole Wyeth (Amyr) · 2025-02-13T20:23:51.435Z · LW(p) · GW(p)

Yes, that is what I think.

Edit: The class of tasks doesn't include autonomously doing important things such as making discoveries. It does include becoming a better coding assistant.

comment by Henrik Kjeldsen (henrik-kjeldsen) · 2025-02-14T16:14:42.550Z · LW(p) · GW(p)

Perhaps cognition is a collection of problems with exponential complexity.

In that case we would see some success, but not be able to scale to much really useful.

Also, the brain would have to not be a Turing machine, and the strong physical Church-Turing thesis would be false.

Seems simpler maybe?

Replies from: Amyr

↑ comment by Cole Wyeth (Amyr) · 2025-02-14T19:40:58.858Z · LW(p) · GW(p)

I'm not prepared to throw out my metaphysics to explain that sometimes research takes a few decades.

My model of what is going on with LLMs

Contents

49 comments