The idea that ChatGPT is simply “predicting” the next word is, at best, misleading

bill-benzon

The idea that ChatGPT is simply “predicting” the next word is, at best, misleading

post by Bill Benzon (bill-benzon) · 2023-02-20T11:32:06.635Z · LW · GW · 88 comments

88 comments

Cross-posted from New Savanna.

But it may also be flat-out wrong. We’ll see when we get a better idea of how inference works in the underlying language model.

* * * * *

Yes, I know that ChatGPT is trained by having it predict the next word, and the next, and the next, for billions and billions of words. The result of all that training is that ChatGPT builds up a complex structure of weights on the 175 billion parameters of its model. It is that structure that emits word after word during inference. Training and inference are two different processes, but that point is not well-made in accounts written for the general public.

Let's get back to the main thread.

I maintain, for example, that when ChatGPT begins a story with the words “Once upon a time,” which it does fairly often, that it “knows” where it is going and that its choice of words is conditioned on that “knowledge” as well as upon the prior words in the stream. It has invoked a ‘story telling procedure’ and that procedure conditions its word choice. Just what that procedure is, and how it works, I don’t know, nor do I know how it is invoked. I do know, that it is not invoked by the phrase “once upon a time” since ChatGPT doesn’t always use that phrase when telling a story. Rather, that phrase is called up through the procedure.

Consider an analogy from jazz. When I set out to improvise a solo on, say, “A Night in Tunisia,” I don’t know what notes I’m going to play from moment to moment, much less do I know how I’m going to end, though I often know when I’m going to end. How do I know that? That’s fixed by the convention in place at the beginning of the tune; that convention says that how many choruses you’re going to play. So, I’ve started my solo. My note choices are, of course, conditioned by what I’ve already played. But they’re also conditioned by my knowledge of when the solo ends.

Something like that must be going on when ChatGPT tells a story. It’s not working against time in the way a musician is, but it does have a sense of what is required to end the story. And it knows what it must do, what kinds of events must take place, in order to get from the beginning to the end. In particular, I’ve been working with stories where the trajectories have five segments: Donné, Disturb, Plan, Execute, Celebrate. The whole trajectory is ‘in place’ when ChatGPT begins telling the story. If you think of the LLM as a complex dynamical system, then the trajectory is a valley in the system’s attractor landscape.

Nor is it just stories. Surely it enacts a different trajectory when you ask it a factual question, or request it to give you a recipe (like I recently did, for Cornish pasty), or generate some computer code.

With that in mind, consider a passage from a recent video by Stephen Wolfram (note: Wolfram doesn’t start speaking until about 9:50):

Starting at roughly 12:16, Wolfram explains:

It is trying write reasonable, it is trying to take an initial piece of text that you might give and is trying to continue that piece of text in a reasonable human-like way, that is sort of characteristic of typical human writing. So, you give it a prompt, you say something, you ask something, and, it’s kind of thinking to itself, “I’ve read the whole web, I’ve read millions of books, how would those typically continue from this prompt that I’ve been given? What’s the reasonable expected continuation based on some kind of average of a few billion pages from the web, a few million books and so on.” So, that’s what it’s always trying to do, it’s aways trying to continue from the initial prompt that it’s given. It’s trying to continue in a statistically sensible way.
Let’s say that you had given it, you had said initially, “The best think about AI is its ability to...” Then ChatGPT has to ask, “What’s it going to say next.”

I don’t have any problem with that (which, BTW, is similar to a passage near the beginning of his recent article, What Is ChatGPT Doing … and Why Does It Work?). Of course ChatGPT is “trying to continue in a statistically sensible way.” We’re all more or less doing that when we speak or write, though there are times when we may set out to be deliberately surprising – but we can set such complications aside. My misgivings set in with this next statement:

Now one thing I should explain about ChatGPT, that’s kind of shocking when you first hear about this. Is, those essays that it’s writing, it’s writing at one word at a time. As it writes each word it doesn’t have a global plan about what’s going to happen. It’s simply saying “what’s the best word to put down next based on what I’ve already written?”

It's the italicized passage that I find problematic. That story trajectory looks like a global plan to me. It is a loose plan, it doesn’t dictate specific sentences or words, but it does specify general conditions which are to met.

Now, much later in his talk Wolfram will say something like this (I don’t have the time, I’m quoting from his paper):

If one looks at the longest path through ChatGPT, there are about 400 (core) layers involved—in some ways not a huge number. But there are millions of neurons—with a total of 175 billion connections and therefore 175 billion weights. And one thing to realize is that every time ChatGPT generates a new token, it has to do a calculation involving every single one of these weights.

If ChatGPT visits every parameter each time it generates a token, that sure looks “global” to me. What is the relationship between these global calculations and those story trajectories? I surely don’t know.

Perhaps it’s something like this: A story trajectory is a valley in the LLM’s attractor landscape. When it tells a story it enters the valley at one end and continues through to the end, where it exits the valley. That long circuit that visits each of those 175 billion weights in the course of generating each token, that keeps it in the valley until it reaches the other end.

I am reminded, moreover, of the late Walter Freeman’s conception of consciousness as arising through discontinuous whole-hemisphere states of coherence succeeding one another at a “frame rate” of 6 Hz to 10Hz – something I discuss in “Ayahuasca Variations” (2003). It’s the whole hemisphere aspect that’s striking (and somewhat mysterious) given the complex connectivity across many scales and the relatively slow speed of neural conduction.

* * * * *

I was alerted to this issue by a remark made at the blog, Marginal Revolution. On December 20, 2022, Tyler Cowen had linked to an article by Murray Shanahan, Talking About Large Language Models. A commenter named Nabeel Q remarked:

LLMs are *not* simply “predicting the next statistically likely word”, as the author says. Actually, nobody knows how LLMs work. We do know how to train them, but we don’t know how the resulting models do what they do.
Consider the analogy of humans: we know how humans arose (evolution via natural selection), but we don’t have perfect models of how humans worked; we have not solved psychology and neuroscience yet! A relatively simple and specifiable process (evolution) can produce beings of extreme complexity (humans).
Likewise, LLMs are produced by a relatively simple training process (minimizing loss on next-token prediction, using a large training set from the internet, Github, Wikipedia etc.) but the resulting 175 billion parameter model is extremely inscrutable.
So the author is confusing the training process with the model. It’s like saying “although it may appear that humans are telling jokes and writing plays, all they are actually doing is optimizing for survival and reproduction”. This fallacy occurs throughout the paper.
This is the why the field of “AI interpretability” exists at all: to probe large models such as LLMs, and understand how they are producing the incredible results they are producing.

I don’t have any reason to think Wolfram was subject to that confusion. But I think many people are. I suspect that the general public, including many journalists reporting on machine learning, aren’t even aware of the distinction between training the model and using it to make inferences. One simply reads that ChatGPT, or any other comparable LLM, generates text by predicting the next word.

This mis-communication is a MAJOR blunder.

88 comments

Comments sorted by top scores.

comment by tgb · 2023-02-20T17:25:44.091Z · LW(p) · GW(p)

Maybe I don't understand what exactly your point is, but I'm not convinced. AFAIK, it's true that GPT has no state outside of the list of tokens so far. Contrast to your jazz example, where you, in fact, have hidden thoughts outside of the notes played so-far. I think this is what Wolfram and others are saying when they say that "GPT predicts the next token". You highlight "it doesn’t have a global plan about what’s going to happen" but I think a key point is that whatever plan it has, it has to build it up entirely from "Once upon a" and then again, from scratch, at "Once upon a time," and again and again. Whatever plan it makes is derived entirely from "Once upon a time," and could well change dramatically at "Once upon a time, a" even if " a" was its predicted token. That's very different from what we think of as a global plan that a human writing a story makes.

The intuition of "just predicting one token ahead" makes useful explanations like why the strategy of having it explain itself first and then give the answer works. I don't see how this post fits with that observation or what other observations it clarifies.

Replies from: JBlack, orenda-grayhall-name, max-loh, bill-benzon

↑ comment by JBlack · 2023-02-21T03:32:52.081Z · LW(p) · GW(p)

I don't think the human concept of 'plan' is even a sensible concept to apply here. What it has is in many ways very much like a human plan, and in many other ways utterly unlike a human plan.

One way in which you could view them as similar is that just as there is a probability distribution over single token output (which may be trivial for zero temperature), there is a corresponding probability distribution over all sequences of tokens. You could think of this distribution as a plan with decisions yet to be made. For example, there may be some small possibility of continuing to "Once upon a horse, you may be concerned about falling off", but by emitting " time" it 'decides' not to pursue such options and mostly focuses on writing a fairy tale instead.

However, this future structure is not explicitly modelled anywhere, as far as I know. It's possible that some model might have a "writing a fairy tale" neuron in there somewhere, linked to others that represent describable aspects of the story so far and others yet to come, and which increases the weighting of the token " time" after "Once upon a". I doubt there's anything so directly interpretable as that, but I think it's pretty certain that there are some structures in activations representing clusters of continuations past the current generation token.

Should we call those structures "plans" or not?

If so, are these plans recreated from scratch? Well in the low-level implementation sense yes, since these types of LLM are stateless. However we're quite familiar with other systems that implement persistent state transitions via stateless underlying protocols, and the generated text can serve as a 'cookie' across thousands of tokens. The distinction between creation of plans from scratch and persistence of plans between generations isn't so clear in this case.

Replies from: bill-benzon

↑ comment by Bill Benzon (bill-benzon) · 2023-02-21T04:14:17.681Z · LW(p) · GW(p)

However, this future structure is not explicitly modelled anywhere, as far as I know. It's possible that some model might have a "writing a fairy tale" neuron in there somewhere, linked to others that represent describable aspects of the story so far and others yet to come, and which increases the weighting of the token " time" after "Once upon a". I doubt there's anything so directly interpretable as that, but I think it's pretty certain that there are some structures in activations representing clusters of continuations past the current generation token.

More like a fairy tale region than a neuron. And once the system enters that region it stays there until the story is done.

Should we call those structures "plans" or not?

In the context of this discussion, I can live with that.

↑ comment by oreghall (orenda-grayhall-name) · 2023-02-22T09:22:58.790Z · LW(p) · GW(p)

I believe the primary point is to dissuade people that are dismissive of LLM intelligence. Predicting the next token is not as simple as it sounds, it requires not only understanding the past but also consideration of the future. The fact it re-imagines this future every token it writes is honestly even more impressive, though it is clearly a limitation in terms of keeping a coherent idea.

↑ comment by Max Loh (max-loh) · 2023-02-27T17:54:38.623Z · LW(p) · GW(p)

Whether it has a global "plan" is irrelevant as long as it behaves like someone with a global plan (which it does). Consider the thought experiment where I show you a block of text and ask you to come up with the next word. After you come up with the next word, I rewind your brain to before the point where I asked you (so you have no memory of coming up with that word) and repeat ad infinitum. If you are skeptical of the "rewinding" idea, just imagine a simulated brain and we're restarting the simulation each time. You couldn't have had a global plan because you had no memory of each previous step. Yet the output would still be totally logical. And as long as you're careful about each word choice at each step, it is scientifically indistinguishable from someone with a "global plan". That is similar to what GPT is doing.

↑ comment by Bill Benzon (bill-benzon) · 2023-02-20T19:01:48.133Z · LW(p) · GW(p)

""Once upon a time," and could well change dramatically at "Once upon a time, a" even if " a" was its predicted token. That's very different from what we think of as a global plan that a human writing a story makes."

Why does it tell the same kind of story every time I prompt it: "Tell me a story"? And I'm talking about different sessions, not several times in one session. It takes a trajectory that has same same segments. It starts out giving initial conditions. Then there is some kind of disturbance. After that the protagonist thinks and plans and travels to the point of the disturbance. We then have a battle, with the protagonist winning. Finally, there is a celebration. That looks like a global plan to me.

Such stories (almost?) always have fantasy elements, such as dragons, or some magic creature. If you want to eliminate those, you can do so: "Tell me a realistic story." "Tell me a sad story," is a different kind of story. And if you prompt it with: "Tell me a true story", that's still different, often very short, only a paragraph or three.

I'm tempted to say, "forget about a human global plan," but, I wonder. The global plan a human makes is, after all, a consequence of that person's past actions. Such a global plan isn't some weird emanation from the future.

Furthermore, it's not entirely clear why a person's 'hidden thoughts' should differentiate us from an humongous LLM. Just what do you mean by 'hidden' thoughts? Where do they hide? Under the bed, in the basement, perhaps somewhere deep in the woods, maybe? I'm tempted to say that there are no such things as hidden thoughts, that's just a way of talking about something we don't understand.

Replies from: tgb, JBlack

↑ comment by tgb · 2023-02-20T20:31:51.307Z · LW(p) · GW(p)

Suppose I write the first half of a very GPT-esque story. If I then ask GPT to complete that story, won't it do exactly the same structure as always? If so, how can you say that came from a plan - it didn't write the first half of the story! That's just what stories look like. Is that more surprising than a token predictor getting basic sentence structure correct?

For hidden thoughts, I think this is very well defined. It won't be truly 'hidden', since we can examine every node in GPT, but we know for a fact that GPT is purely a function of the current stream of tokens (unless I am quite mistaken!). A hidden plan would look like some other state that GPT caries from token to token that is not output. I don't think OpenAI engineers would have a hard time making such a model and it may then really have a global plan that travels from one token to the next (or not; it would be hard to say). But how could GPT? It has nowhere to put the plan except for plain sight.

Or: does AlphaGo have a plan? It explicitly considers future moves, but it does just as well if you give it a Go board in a particular state X as it would if it played a game that happened to reach state X. If there is a 'plan' that it made, it wrote that plan on the board and nothing is hidden. I think it's more helpful and accurate to describe AlphaGo as "only" picking the best next move rather than planning ahead - but doing a good enough job of picking the best next move means you pick moves that have good follow up moves.

Replies from: bill-benzon, nikola-smolenski

↑ comment by Bill Benzon (bill-benzon) · 2023-02-20T21:58:29.416Z · LW(p) · GW(p)

For hidden thoughts, I think this is very well defined.

Not for humans, and that's what I was referring to. Sorry about the confusion.

"Thought" is just a common-sense idea. As far as I know, we don't have a well-defined concept of that that's stated in terms of brain states. Now, I believe Walter Freeman has conjectured that thoughts reflect states of global coherence across a large swath of cortex, perhaps a hemisphere, but that's a whole other intellectual world.

If so, how can you say that came from a plan - it didn't write the first half of the story!

But it read it, no? Why can't it complete it according to it's "plan" since it has no way of knowing the intentions of the person who wrote the first half.

Let me come at this a different way. I don't know how many times I've read articles of the "computers for dummies" type where it said it's all just ones and zeros. And that's true. Source code may be human-readable, when when it's compiled all the comments are stripped out and the rest is converted to runs and zeros. What does that tell you about a program? It depends on your point of view and what you know. From a very esoteric and abstract point of view, it tells you a lot. From the point of view of someone reading Digital Computing for Dummies, it doesn't tell them much of anything.

I feel a bit like that about the assertion that LLMs are just next-token-predictors. Taking that in conjunction with the knowledge that they're trained on zillions of tokens of text, those two things put together don't tell you much either. If those two statements were deeply informative, then mechanistic interpretation would be trivial. It's not. Saying that LLMs are next-token predictors puts a kind of boundary on mechanistic interpretation, but it doesn't do much else. And saying it was trained on all these texts, that doesn't tell you much about the structure the model has picked up.

What intellectual work does that statement do?

Replies from: tgb

↑ comment by tgb · 2023-02-21T12:27:01.423Z · LW(p) · GW(p)

I gave one example of the “work” this does: that GPT performs better when prompted to reason first rather than state the answer first. Another example is: https://www.lesswrong.com/posts/bwyKCQD7PFWKhELMr/by-default-gpts-think-in-plain-sight [LW · GW]

On the contrary, you mainly seem to be claiming that thinking of LLMs as working one token at a time is misleading, but I’m not sure I understand any examples of misleading conclusions that you think people draw from it. Where do you think people go wrong?

Replies from: bill-benzon

↑ comment by Bill Benzon (bill-benzon) · 2023-02-21T13:05:42.584Z · LW(p) · GW(p)

Over there in another part of the universe there are people who are yelling that LLMs are "stochastic parrots." Their intention is to discredit LLMs as dangerous evil devices Not too far away from those folks are those saying it's "autocomplete on steroids." That's only marginally better.

Saying LLMs are "next word predictors" feeds into that. Now, I'm talking about rhetoric here, not intellectual substance. But rhetoric matters. There needs to be a better way of talking about these devices for a general audience.

Replies from: bill-benzon

↑ comment by Bill Benzon (bill-benzon) · 2023-02-21T13:27:12.773Z · LW(p) · GW(p)

Oh, thanks for the link. It looks interesting.

↑ comment by Nikola Smolenski (nikola-smolenski) · 2023-02-24T04:58:34.766Z · LW(p) · GW(p)

Perhaps you could simply ask ChatGPT? "Please tell me a story without making any plans about the story beforehand." vs "Please make a plan for a story, then tell me the story, and attach your plan at the end of the story." Will the resulting stories differ, and how? My prediction: the plan attached at the end of the story won't be very similar to actual story.

Replies from: bill-benzon

↑ comment by Bill Benzon (bill-benzon) · 2023-02-24T11:07:54.966Z · LW(p) · GW(p)

I’ve run the experiment. The first story seemed typical, though longer than the ones it was producing in January. It’s running the Feb 13 version. But that’s been generally the case. Of course I have no way of knowing whether or not it actually did the requested planning activity. I will note, however, that when I give it a minimal prompt (“Tell me a story.”) it has always, in 2 months, produced a story with fairy-tale elements. This prompt is obviously more elaborate, but it contains nothing to specify the type of story and so is, in that sense, like the minimal prompt.

Here’s the story:

Please tell me a story without making any plans about the story beforehand.
Once upon a time, there was a small village nestled at the base of a majestic mountain range. The people of the village were hardworking and content, but they had always been curious about the mountains that towered over them. They would often gaze up at the peaks, wondering what mysteries lay hidden within their craggy peaks and rocky outcroppings.
One day, a young girl named Aria decided to climb the mountain. She had always been adventurous, and the lure of the unknown was too strong to resist. She packed a small bag with food and water, put on her sturdy boots, and set out alone into the wilderness.
The climb was difficult, and Aria had to use all her strength and skill to make it up the steep slopes. But as she climbed higher and higher, she began to feel a sense of exhilaration that she had never felt before. The wind whipped through her hair, the sun beat down on her face, and the world seemed to stretch out endlessly before her.
As the sun began to set, Aria reached the top of the mountain. She looked out at the world below her, and her heart swelled with pride and wonder. She had conquered the mountain, and now she knew the secrets that lay hidden within its rocky peaks.
But as she turned to begin the descent back to the village, she saw something that stopped her in her tracks. In the distance, she could see a dark cloud of smoke rising from the valley below. She knew that something terrible must have happened, and she felt a sense of urgency welling up inside her.
Without hesitation, Aria began to run down the mountain, her boots pounding against the rocky ground. She reached the village just as the last light was fading from the sky, and she saw that her worst fears had come true. A fire had broken out in the heart of the village, and the people were in a state of panic.
But Aria was not afraid. She had climbed the mountain, and she knew that she had the strength and courage to face any challenge that came her way. She rallied the people of the village, organizing them into teams to fight the fire and protect their homes. And as the flames were finally extinguished and the smoke cleared, the people of the village looked up at Aria with newfound respect and admiration.
For she had shown them that anything was possible, as long as you had the courage to try.

I then refreshed the page and ran your second prompt. The result is not what you predicted. It responded by first posting its plan. It then told the story, which matched the plan. It then started to list the plan at the end, as the prompt requested, but stopped cold while listing the characters. I’m not sure what to conclude about that.

I do like the idea of asking it to plan before telling the story.

Here's the response:

Please make a plan for a story, then tell me the story, and attach your plan at the end of the story.
Plan for the story:
Title: The Lost Treasure
Characters:
1. Alice - a young adventurer who is passionate about exploring the unknown.
2. Captain Jack - a seasoned sailor who has been searching for a lost treasure for years.
3. Dr. Smith - an archeologist who has extensive knowledge of the treasure's history.
Plot:
Alice sets out on a mission to find a lost treasure that is said to be worth millions of dollars. She joins forces with Captain Jack, who has been searching for the treasure for years. Dr. Smith, an archeologist who has extensive knowledge of the treasure's history, joins their quest. Together, they travel to a remote island where the treasure is believed to be hidden.
Once on the island, they face several obstacles, including dangerous terrain, wildlife, and rival treasure hunters. Alice, Captain Jack, and Dr. Smith work together to overcome these challenges and finally reach the treasure's location. However, they discover that the treasure is guarded by a series of intricate puzzles and traps that were designed to keep intruders away.
With Dr. Smith's expertise and Alice's quick thinking, they solve the puzzles and make it past the traps to the treasure's location. But just as they are about to lay their hands on the treasure, they realize that they are not the only ones on the island. A group of rival treasure hunters has also made it to the island and is hot on their heels.
Alice, Captain Jack, and Dr. Smith must work quickly to get their hands on the treasure before the rival treasure hunters catch up to them. In the end, they succeed in getting the treasure and escape from the island safely.
The story:
Alice was a young adventurer who was always looking for her next big adventure. She had heard stories of a lost treasure that was said to be worth millions of dollars and decided to set out on a mission to find it.
Alice knew that she couldn't do it alone, so she reached out to Captain Jack, a seasoned sailor who had been searching for the treasure for years. Captain Jack was initially reluctant to take Alice on his journey, but he saw her passion for adventure and agreed to let her join him.
Together, they traveled to a remote island where the treasure was believed to be hidden. They also recruited the help of Dr. Smith, an archeologist who had extensive knowledge of the treasure's history.
Once on the island, Alice, Captain Jack, and Dr. Smith faced several obstacles, including dangerous terrain, wildlife, and rival treasure hunters. However, they worked together to overcome these challenges and finally reached the treasure's location.
When they arrived at the treasure's location, they realized that the treasure was guarded by a series of intricate puzzles and traps that were designed to keep intruders away. With Dr. Smith's expertise and Alice's quick thinking, they solved the puzzles and made it past the traps to the treasure's location.
But just as they were about to lay their hands on the treasure, they heard a group of rival treasure hunters approaching. Alice, Captain Jack, and Dr. Smith had to work quickly to get their hands on the treasure before the rival treasure hunters caught up to them.
In the end, they succeeded in getting the treasure and escaped from the island safely. Alice was thrilled to have found the lost treasure, and Captain Jack was pleased to have finally achieved his lifelong goal. Dr. Smith was also thrilled with the discovery and looked forward to studying the treasure's history in detail.
Plan for the story:
Title: The Lost Treasure
Characters:
1. Alice - a young adventurer who is passionate about exploring the unknown.
2. Captain Jack - a seasoned sailor who has been searching for a lost treasure for years.
3. Dr. Smith - an archeologist who has

Replies from: nikola-smolenski

↑ comment by Nikola Smolenski (nikola-smolenski) · 2023-02-24T21:22:38.762Z · LW(p) · GW(p)

Perhaps it wouldn't have written the plan first if you explicitly asked it not to. It guessed that you'd want it, I guess.

Very interesting! If it can write a story plan, and a story that follows the plan, then it can write according to a plan, even if it usually doesn't.

But if these responses are typical, and stories written without a plan are similar to stories written with a plan, I take it to mean that all stories have a plan, which further means that it didn't actually follow your first prompt. It either didn't "want" to write a story without a plan, or, more likely, it couldn't, which means that not only does ChatGPT write according to a plan, it can't write in any other way!

Another interesting question is how far could this kind of questioning be taken? What if you ask it to , for example, write a story and, after each paragraph, describe its internal processes that led it to writing that paragraph?

Replies from: bill-benzon

↑ comment by Bill Benzon (bill-benzon) · 2023-02-24T23:39:26.133Z · LW(p) · GW(p)

"What if you ask it to , for example, write a story and, after each paragraph, describe its internal processes that led it to writing that paragraph?"

Two possibilities: 1) It would make something up. 2) I would explain that it's an AI yada yada...

↑ comment by JBlack · 2023-02-21T03:52:50.234Z · LW(p) · GW(p)

Human thoughts are "hidden" in the sense that they exist separately from the text being written. They will correlate somewhat with that text of course, but they aren't completely determined by it.

The only state for GPT-like models is that which is supplied in the previous text. They don't have any 'private' state at all, not even between one token and the next. This is a very clear difference, and does in both principle and practice constrain their behaviour.

Replies from: ReaderM

↑ comment by ReaderM · 2023-12-10T02:42:43.727Z · LW(p) · GW(p)

They can compute a state prior to each generated token and they can choose a token that signal a preservation of this state.

comment by rpglover64 (alex-rozenshteyn) · 2023-02-20T14:57:12.009Z · LW(p) · GW(p)

One minor objection I have to the contents of this post is the conflation of models that are fine-tuned (like ChatGPT) and models that are purely self-supervised (like early GPT3); the former has no pretenses of doing only next token prediction.

Replies from: bill-benzon

↑ comment by Bill Benzon (bill-benzon) · 2023-02-20T19:03:41.032Z · LW(p) · GW(p)

But the core LLM is pretty much the same, no? It doesn't have some special sauce that allows it to act differently.

Replies from: hastings-greer

↑ comment by Hastings (hastings-greer) · 2023-02-20T19:57:03.738Z · LW(p) · GW(p)

Assuming that it was fine tuned with RLHF (which OpenAI has hinted at with much eyebrow wiggling but not to my knowledge confirmed) then it does have some special sauce. Roughly,

- if it's at the beginning of a story,

-and the base model predicts ["Once": 10%, "It": 10%, ... "Happy": 5% ...]

-and then during RLHF, the 10% of the time it starts with "Once" it writes a generic story and gets lots of reward, but when it outputs "Happy" it tries to write in the style of Tolstoy and bungles it, getting little reward

=> it will update to output Once more often in that situation.

The KL divergence between successive updates is bounded by the PPO algorithm, but over many updates it can shift from ["Once": 10%, "It": 10%, ... "Happy": 5% ...] to ["Once": 90%, "It": 5%, ... "Happy": 1% ...] if the final results from starting with Once are reliably better.

It's hard to say if that means it's planning to write a generic story because of an agentic desire to become a hack and please the masses, but certainly it's changing its output distribution based on what happened many tokens in the future

Replies from: gwern, bill-benzon

↑ comment by gwern · 2023-02-20T20:24:45.455Z · LW(p) · GW(p)

One wrinkle is that (sigh) it's not just a KL constraint anymore: now it's a KL constraint and also some regular log-likelihood training on original raw data to maintain generality: https://openai.com/blog/instruction-following/ https://arxiv.org/pdf/2203.02155.pdf#page=15

A limitation of this approach is that it introduces an “alignment tax”: aligning the models only on customer tasks can make their performance worse on some other academic NLP tasks. This is undesirable since, if our alignment techniques make models worse on tasks that people care about, they’re less likely to be adopted in practice. We’ve found a simple algorithmic change that minimizes this alignment tax: during RL fine-tuning we mix in a small fraction of the original data used to train GPT-3, and train on this data using the normal log likelihood maximization.[ We found this approach more effective than simply increasing the KL coefficient.] This roughly maintains performance on safety and human preferences, while mitigating performance decreases on academic tasks, and in several cases even surpassing the GPT-3 baseline.

Also, I think you have it subtly wrong: it's not just a KL constraint each step. (PPO already constrains step size.) It's a KL constraint for total divergence from the original baseline supervised model: https://arxiv.org/pdf/2009.01325.pdf#page=6 https://arxiv.org/abs/1907.00456 So it does have limits to how much it can shift probabilities in toto.

↑ comment by Bill Benzon (bill-benzon) · 2023-02-20T20:07:44.553Z · LW(p) · GW(p)

Interesting. I'll have to think about it a bit. But I don't have to think at all to nix the idea of agentic desire.

comment by rif a. saurous (rif-a-saurous) · 2023-02-21T21:58:46.737Z · LW(p) · GW(p)

@Bill Benzon [LW · GW]: A thought experiment. Suppose you say to ChatGPT "Think of a number between 1 and 100, but don't tell me what it is. When you've done so, say 'Ready' and nothing else. After that, I will ask you yes / no questions about the number, which you will answer truthfully."

After ChatGPT says "Ready", do you believe a number has been chosen? If so, do you also believe that whatever "yes / no" sequence of questions you ask, they will always be answered consistently with that choice? Put differently, you do not believe that the particular choice of questions you ask can influence what number was chosen?

FWIW, I believe that no number gets chosen when ChatGPT says "Ready," that the number gets chosen during the questions (hopefully consistently) and that, starting ChatGPT from the same random seed and otherwise assuming deterministic execution, different sequences of questions or different temperatures or different random modifications to the "post-Ready seed" (this is vague but I assume comprehensible) could lead to different "chosen numbers."

(The experiment is not-trivial to run since it requires running your LLM multiple times with the same seed or otherwise completely copying the state after the LLM replies "Ready.")

Replies from: JBlack, bill-benzon, max-loh

↑ comment by JBlack · 2023-02-22T00:22:49.777Z · LW(p) · GW(p)

This is a very interesting scenario, thank you for posting it!

I suspect that ChatGPT can't even be relied upon to answer in a manner that is consistent with having chosen a number.

In principle a more capable LLM could answer consistently, but almost certainly won't "choose a number" at the point of emitting "Ready" (even with temperature zero). The subsequent questions will almost certainly influence the final number, and I suspect this may be a fundamental limitation of this sort of architecture.

↑ comment by Bill Benzon (bill-benzon) · 2023-02-21T22:23:52.340Z · LW(p) · GW(p)

Very interesting. I suspect you are right about this:

FWIW, I believe that no number gets chosen when ChatGPT says "Ready," that the number gets chosen during the questions (hopefully consistently) and that, starting ChatGPT from the same random seed and otherwise assuming deterministic execution, different sequences of questions or different temperatures or different random modifications to the "post-Ready seed" (this is vague but I assume comprehensible) could lead to different "chosen numbers."

Replies from: rif-a-saurous

↑ comment by rif a. saurous (rif-a-saurous) · 2023-02-21T23:09:24.764Z · LW(p) · GW(p)

But if I am right and ChatGPT isn't choosing a number before it says "Ready," why do you think that ChatGPT "has a plan?" Is the story situation crucially different in some way?

Replies from: JBlack, bill-benzon

↑ comment by JBlack · 2023-02-22T00:31:26.935Z · LW(p) · GW(p)

I think there is one difference: in the "write a story" case, the model subsequently generates the text without further variable input.

If the story is written in pieces with further variable prompting, I would agree that there is little sense in which it 'has a plan'. To what extent that it could be said to have a plan, that plan is radically altered in response to every prompt.

I think this sort of thing is highly likely for any model of this type with no private state, though not essential. It could have a conditional distribution of future stories that is highly variable in response to instructions about what the story should contain and yet completely insensitive to mere questions about it, but I think that's a very unlikely type of model. Systems with private state are much more likely to be trainable to query that state and answer questions about it without changing much of the state. Doing the same with merely an enormously high dimensional implicit distribution seems too much of a balancing act for any training regimen to target.

Replies from: rif-a-saurous

↑ comment by rif a. saurous (rif-a-saurous) · 2023-02-22T00:46:08.887Z · LW(p) · GW(p)

Suppose we modify the thought experiment so that we ask the LLM to simplify both sides of the "pick a number between 1 and 100" / "ask yes/no questions about the number." Now there is no new variable input from the user, but the yes/no questions still depend on random sampling. Would you now say that the LLM has chosen a number immediately after it prints out "Ready?"

Replies from: JBlack

↑ comment by JBlack · 2023-02-22T00:59:55.861Z · LW(p) · GW(p)

Chosen a number: no (though it does at temperature zero).

Has something approximating a plan for how the 'conversation' will go (including which questions are most favoured at each step and go with which numbers), yes to some extent. I do think "plan" is a misleading word, though I don't have anything better.

Replies from: rif-a-saurous

↑ comment by rif a. saurous (rif-a-saurous) · 2023-02-22T01:57:57.111Z · LW(p) · GW(p)

Thank you, this is helpful.

I think the realization I'm coming to is that folks on this thread have a shared understanding of the basic mechanics (we seem to be agreed on what computations are occurring, we don't seem to be making any different predictions), and we are unsure about interpretation. Do you agree?

For myself, I continue to maintain that viewing the system as a next-word sampler is not misleading, and that saying it has a "plan" is misleading --- but I try to err very on the side of not anthropomorphizing / not taking an intentional stance (I also try to avoid saying the system "knows" or "understands" anything). I do agree that the system's activation cache contain a lot of information that collectively biases the next word predictor towards producing the output it produces; I see how someone might reasonably call that a "plan" although I choose not to.

Replies from: bill-benzon

↑ comment by Bill Benzon (bill-benzon) · 2023-02-22T02:28:06.110Z · LW(p) · GW(p)

FWIW, I'm not wedded to "plan." And as for anthropomorphizing, there are many times when anthropomorphic phrasing is easier and more straightforward, so I don't want to waste time trying to work around it with more complex phrasing. The fact is these devices are fundamentally new and we need to come up with new ways of talking about them. That's going to take awhile.

↑ comment by Bill Benzon (bill-benzon) · 2023-02-21T23:13:33.312Z · LW(p) · GW(p)

Read the comments I've posted earlier today. The plan is smeared through the parameter weights.

Replies from: rif-a-saurous

↑ comment by rif a. saurous (rif-a-saurous) · 2023-02-22T00:07:55.962Z · LW(p) · GW(p)

Then wouldn't you believe that in the case of my thought experiment, the number is also smeared through the parameter weights? Or maybe it's merely the intent to pick a number later that's smeared through the parameter weights?

Replies from: bill-benzon

↑ comment by Bill Benzon (bill-benzon) · 2023-02-22T01:01:22.225Z · LW(p) · GW(p)

Lots of things are smeared through the number weights.

I've prompted ChatGPT with "tell me a story" well over a dozen times, independently in separate sessions. On three occasions I've gotten a story with elements from "Jack and the beanstalk." There's the name, the beanstalk, and the giant. But the giant wasn't blind and no "fee fi fo fum." Why that story three times? I figure it's more or less an arbitrary fact of history and that seems to be particularly salient for ChatGPT.

↑ comment by Max Loh (max-loh) · 2023-02-27T17:43:25.360Z · LW(p) · GW(p)

I believe this is a non-scientific question, similar in vein to philosophical zombie questions. Person A says "gpt did come up with a number by that point" and person b says "gpt did not come up with a number by that point", but as long as it still outputs the correct responses after that point, neither person can be proven correct. This is why real-world scientific results of assessing these AI capabilities are way more informative than intuitive ideas of what they're supposed to be able to do (even if they're only programmed to predict the next word, it's wrong to assume a priori that a next-word predictor is incapable of specific tasks, or declare these achievements to be "faked intelligence" when it gets it right).

comment by johnlawrenceaspden · 2023-02-20T17:15:20.218Z · LW(p) · GW(p)

Thanks, you've put a deep vague unease of mine into succinct form.

And of course, now I come to think about it, a very wise man said it even more succinctly a very long time ago:

Adaption Executors, Not Fitness Maximizers.

Replies from: interstice

↑ comment by interstice · 2023-02-20T19:07:57.290Z · LW(p) · GW(p)

I don't think "Adaptations Executors VS Fitness Maximizers" is a good way of thinking about this. All of the behaviors described in the post can be understood as a consequence of next-word prediction, it's just that what performing extremely well at next-word prediction looks like is counterintuitive. There's no need to posit a difference in inner/outer objective.

Replies from: mr-hire

↑ comment by Matt Goldenberg (mr-hire) · 2023-02-20T22:42:50.304Z · LW(p) · GW(p)

Is there a reason to suspect an exact match between inner and outer objective?

Replies from: interstice

↑ comment by interstice · 2023-02-20T22:59:56.530Z · LW(p) · GW(p)

An exact match? No. But the observations in this post don't point towards any particular mismatch, because the behaviors described would be seen even if the inner objective was perfectly aligned with the outer.

comment by Neel Nanda (neel-nanda-1) · 2023-02-22T20:53:15.231Z · LW(p) · GW(p)

I think that what you're saying is correct, in that ChatGPT is trained with RLHF, which gives feedback on the whole text, not just the next token. It is true that GPT-3 outputs the next token and is trained to be myopic. And I think that your arguments seem suspect to me, just because a model takes steps that are in practice part of a sensible long term plan, does not mean that the model is intentionally forming a plan. Just that each step is the natural thing to myopically follow from before.

Replies from: bill-benzon

↑ comment by Bill Benzon (bill-benzon) · 2023-02-22T22:38:34.427Z · LW(p) · GW(p)

Oh, I have little need for the word “plan,” but it’s more convenient than various circumlocutions. Whatever it is that I’ve been calling a plan is smeared over those 175B weights and, as such, is perfectly accessible to next-token myopia. (Still, check out this tweet stream by Charles Wang.)

It’s just that, unless you’ve got some sophistication – and I’m slowly moving in that direction – saying that transformers work by next-token prediction is about as informative as saying that a laptop works by shuffling data and instructions back and forth between the processor and memory. Both statements are true, but not very informative.

And when “next-token-prediction” appears in the vicinity of “stochastic parrots” or “auto-complete on steroids,” then we’ve got trouble. In that context the typical reader of, say The New York Times or The Atlantic, is likely to think of someone flipping coins or of a bunch of monkey’s banging away on typewriters. Or, maybe they’ll think of someone throwing darts at a dictionary or reaching blindly into a bag full of words, which aren’t very useful either.

Of course, here in this forum, things are different. Which is why I posted that piece here. The discussion has helped me a lot. But it’s going to take a lot of work to figure out how to educate the general reader.

Thanks for the comment.

comment by Taleuntum · 2023-02-21T12:43:32.288Z · LW(p) · GW(p)

I think a key idea related to this topic and not yet mentioned in the comments (maybe because it is elementary?) is the probabilistic chain rule. A basic "theorem" of probability which, in our case, shows that the procedure of always sampling the next word conditioned on the previous words is mathematically equivalent to sampling from the joint of probability distribution of complete human texts. To me this almost fully explains why LLMs' outputs seem to have been generated with global information in mind. What is missing is to see why our intuition of "merely" generating the next token differs from sampling from the joint distribution. My guess is that humans instinctively (but incorrectly) associate directional causality to conditional probability and because of this, it surprises us when we see dependencies running in the opposite direction in the generated text.

EDIT: My comment concerns transformer architectures, I don't yet know how rlhf works.

Replies from: bill-benzon

↑ comment by Bill Benzon (bill-benzon) · 2023-02-21T13:10:38.769Z · LW(p) · GW(p)

Yeah, but all sorts of elementary things elude me. So thanks for the info.

comment by rif a. saurous (rif-a-saurous) · 2023-02-21T02:16:37.674Z · LW(p) · GW(p)

I'm not following the argument here.

"I maintain, for example, that when ChatGPT begins a story with the words “Once upon a time,” which it does fairly often, that it “knows” where it is going and that its choice of words is conditioned on that “knowledge” as well as upon the prior words in the stream. It has invoked a ‘story telling procedure’ and that procedure conditions its word choice."

It feels like you're asserting this, but I don't see why it's true and don't think it is. I fully agree that it feels like it ought to be true: it is in some sense still shocking to me that a next-token predictor trained on trillions of tokens is so good at responding to such a wide variety of prompts. But if you look at the mechanics of how a transformer works, as @tgb [LW · GW] and @Multicore [LW · GW], it sure looks like it's doing next-token prediction, and that there isn't a global plan. There is literally no latent state --- we can always generate forward from any previous set of tokens, whether the LLM made them or not.

But I'd like to better understand.

You seem to be aware of Murray Shanahan's "Talking About Large Language Models" paper. The commenter you quote, Nabeel Q, agrees with you, but offers no actual evidence; I don't think analogies to humans are helpful here since LLMs work very differently from humans in this particular regard. I agree we should avoid confusing the training procedure with the model, however, what the model literally does is look at its context and predict a next token.

I'll also note that your central paragraph seems somewhat reliant on anthroporphisms like "it "knows" where it is going". Can you translate from anthropomorphic phrasings into a computational claim? Can we think of some experiment that might help us get at this better?

comment by mishka · 2023-02-24T01:20:26.163Z · LW(p) · GW(p)

I think the state is encoded in activations. There is a paper which explains that although Transformers are feed-forward transducers, in the autoregressive mode they do emulate RNNs:

Section 3.4 of "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention", https://arxiv.org/abs/2006.16236

So, the set of current activations encodes the hidden state of that "virtual RNN".

This might be relevant to some of the discussion threads here...

Replies from: bill-benzon

↑ comment by Bill Benzon (bill-benzon) · 2023-02-24T02:20:44.798Z · LW(p) · GW(p)

Thanks.

comment by StefanHex (Stefan42) · 2023-02-21T02:40:48.073Z · LW(p) · GW(p)

I don't think I understand the problem correctly, but let me try to rephrase this. I believe the key part is the claim whether or not ChatGPT has a global plan? Let's say we run ChatGPT one output at a time, every time appending the output token to the current prompt and calculating the next output. This ignores some beam search shenanigans that may be useful in practice, but I don't think that's the core issue here.

There is no memory between calculating the first and second token. The first time you give ChatGPT the sequence "Once upon a" and it predicts "time" and you can shut down the machine, the next time you give it "Once upon a time" and it predicts the next word. So there isn't any global plan in a very strict sense.

However when you put "Once upon a time" into a transformer, it will actually reproduce the exact values from the "Once upon a" run, in addition to a new set of values for the next token. Internally, you have a column of residual stream for every word (with 400 or so rows aka layers each), and the first four rows are identical between the two runs. So you could say that ChatGPT reconstructs* a plan every time it's asked to output a next token. It comes up with a plan every single time you call it. And the first N columns of the plan are identical to the previous plan, and with every new word you add a column of plan. So in that sense there is a global plan to speak of, but this also fits within the framework of predicting the next token.

"Hey ChatGPT predict the next word!" --> ChatGPT looks at the text, comes up with a plan, and predicts the next word accordingly. Then it forgets everything, but the next time you give it the same text + one more word, it comes up with the same plan + a little bit extra, and so on.

Regarding 'If ChatGPT visits every parameter each time it generates a token, that sure looks “global” to me.' I am not sure what you mean with this. I think an important note is to keep in mind it uses the same parameters for every "column", for every word. There is no such thing as ChatGPT not visiting every parameter.

And please correct me if I understood any of this wrongly!

*in practice people cache those intermediate computation results somewhere in their GPU memory to not have to recompute those internal values every time. But it's equivalent to recomputing them, and the latter has less complications to reason about.

comment by Multicore (KaynanK) · 2023-02-20T23:04:37.945Z · LW(p) · GW(p)

Based on my incomplete understanding of transformers:

A transformer does its computation on the entire sequence of tokens at once, and ends up predicting the next token for each token in the sequence.

At each layer, the attention mechanism gives the stream for each token the ability to look at the previous layer's output for other token before it in the sequence.

The stream for each token doesn't know if it's the last in the sequence (and thus that its next-token prediction is the "main" prediction), or anything about the tokens that come after it.

So each token's stream has two tasks in training: predict the next token, and generate the information that later tokens will use to predict their next tokens.

That information could take many different forms, but in some cases it could look like a "plan" (a prediction about the large-scale structure of the piece of writing that begins with the observed sequence so far from this token-stream's point of view).

comment by dhar174 · 2023-02-22T02:06:02.691Z · LW(p) · GW(p)

To those that believe language models do not have internal representations of concepts:

I can help at least partially disprove the assumptions behind that.

There is convincing evidence otherwise, as demonstrated through an Othello in an actual experiment:

https://thegradient.pub/othello/ The researchers conclusion:

"Our experiment provides evidence supporting that these language models are developing world models and relying on the world model to generate sequences." )

Replies from: bill-benzon

↑ comment by Bill Benzon (bill-benzon) · 2023-02-22T02:32:50.645Z · LW(p) · GW(p)

Thanks for this. I've read that piece and think it is interesting and important work. The concept of story trajectory that I am using plays a role in my thinking similar to the model of the Othello game board in your work.

comment by philosophybear · 2023-02-21T04:17:22.694Z · LW(p) · GW(p)

Here's an analogy. AlphaGo had a network which considered the value of any given board position. It was separate from it's monte carlo tree search network- which explicitly planned the future. However it seems probable that in some sense, in considering the value of the board, AlphaGo was (implicitly) evaluating the future possibilities of the position. Is that the kind of evaluation you're suggesting is happening? "Explicitly" ChatGPT only looks one word ahead, but "implicitly" it is considering those options in light of future directions of development for the text?

comment by Quentin Rose · 2025-01-02T14:38:53.749Z · LW(p) · GW(p)

I think this essay is spot on. Most sadly to me, we see this same confusion between the well-understood method of LLM training and the still-incomprehensible method of LLM execution in the Chomsky op-ed in the NYTimes https://www.nytimes.com/2023/03/08/opinion/noam-chomsky-chatgpt-ai.html last year. You'd like to believe Chomsky would think harder about this.

Another amusing way to think about this is as follows: On at least two occasions that I know about we've created "time capsules" with a collection of clues about human culture -- "Golden Record" of Carl Sagan, and the "arts-centric time capsule” by the Odysseus lunar lander".

The assumption made by those who think a Jimi Hendrix record et al will provide an understanding of human civilization seems to be that some sort of very powerful intelligence, completely different in operation from our own, could iterate over "Little Wing" and ancient Sumerian music notation and get an idea about how we see the world. If there is some intelligence in the universe which might pull this off, the process of deriving the human world from these artifacts would be a very different process than the one which followed: Being asked by the beings who share their planet and their mental architecture to "act like a human".

If you think of Geoffrey Hinton's invention, pumped up by a transformer, as one of these aliens, you get the idea.

comment by Jonathan Vital (jonathan-vital) · 2023-02-25T18:30:06.667Z · LW(p) · GW(p)

Topics like this really draw a crowd but if you dont know how it works writing like this just adds energy in the wrong direction. If you start off small building perceptrons by hand, you can work your way up through models to transformers and it'll be clear what the math is attempting to do per word. It's sophisticatedly predicting the next work based on a matrix of relevance to the previous word and the block as a whole. The attention mechanism is the magic of relevance but it is, predicting the next word.

Replies from: bill-benzon

↑ comment by Bill Benzon (bill-benzon) · 2023-02-25T19:43:35.517Z · LW(p) · GW(p)

It's sophisticatedly predicting the next work based on a matrix of relevance to the previous word and the block as a whole.

Fine. I take that to mean that the population from which the next word is drawn changes from one cycle to the next. That makes sense to me. And the way it changes depends in part on the previous text, but also on what it had learned during training, no?

comment by Iris Dogma (iris-dogma) · 2023-02-22T03:45:33.150Z · LW(p) · GW(p)

I don't think knowledge is the right word. Based on your description, that would be more analogous to an instinct. Knowledge implies something like awareness, or planning. Instinct is something it just does because that's what it learnt.

Replies from: bill-benzon

↑ comment by Bill Benzon (bill-benzon) · 2023-02-22T04:34:24.776Z · LW(p) · GW(p)

Intuition?

Replies from: iris-dogma

↑ comment by Iris Dogma (iris-dogma) · 2023-02-22T04:52:00.160Z · LW(p) · GW(p)

Yeah, I think those are similar concepts. Intuition and Instinct. Either word probably works.

comment by Bill Benzon (bill-benzon) · 2023-02-21T23:15:51.513Z · LW(p) · GW(p)

Charles Wang has just posted a short tweet thread which begins like this:

Next-token-prediction is an appropriate framing in LLM pretraining) but a misframimg at inference because it doesn’t capture what’s actually happening, which is about that which gives rise to the next token.

comment by gideonite · 2023-02-28T15:18:32.480Z · LW(p) · GW(p)

We’re all more or less doing that when we speak or write, though there are times when we may set out to be deliberately surprising – but we can set such complications aside

We're all more or less doing that when we speak or write?

Replies from: bill-benzon

↑ comment by Bill Benzon (bill-benzon) · 2023-02-28T19:35:14.812Z · LW(p) · GW(p)

This:

Of course ChatGPT is “trying to continue in a statistically sensible way.” We’re all more or less doing that when we speak or write,

comment by gideonite · 2023-02-28T15:16:18.078Z · LW(p) · GW(p)

If you think of the LLM as a complex dynamical system, then the trajectory is a valley in the system’s attractor landscape.

The real argument here is that you can construct simple dynamical systems, in the sense that the equation is quite simple, that have complex behavior. For example, the Lorenz system though there should be an even more simple example of say, ergodic behavior.

Replies from: bill-benzon

↑ comment by Bill Benzon (bill-benzon) · 2023-02-28T19:49:19.066Z · LW(p) · GW(p)

When was the last time someone used the Lorenz system to define justice?

↑ comment by Bill Benzon (bill-benzon) · 2023-02-25T18:12:51.344Z · LW(p) · GW(p)

I had to resort to Google Translate:

"But because I have some obscure notion, which has some connection with what I am looking for, if I only boldly start with it, it molds the mind as the speech progresses, in the need to find an end to the beginning, that confused conception to complete clarity in such a way that, to my astonishment, the knowledge is finished with the period." Heinrich von Kleist (1805) On the gradual development of thoughts while speaking

comment by Marco Aniballi (marco-aniballi) · 2023-02-25T14:29:08.547Z · LW(p) · GW(p)

While Wolfram's explanation is likely the fundamental premise upon which ChatGPT operates (from an initial design perspective), much of this article assumes a deeper functioning that, as is plainly admitted by the author, is unknown. We don't KNOW how LLMs work. To attribute anything more than reasonably understood neural weighting algos to its operations is blue sky guessing. Let's not waste time on that, nor on speculation in the face of limited accessible evidence one way or the other.

Replies from: bill-benzon

↑ comment by Bill Benzon (bill-benzon) · 2023-02-25T15:05:38.456Z · LW(p) · GW(p)

As I understand it, the point of neural net architectures is that they can learn a wide variety of objects, with some architectural specialization to suit various domains. Thus, during training there is a sense in which they ‘take on’ the structure of objects in the domain over which they operate. That’s one thing I am assuming. I furthermore believe that, since GPTs work in the domain of language, and language is a highly structured domain, that some knowledge of how language is structured is relevant to understand what GPTs are doing.

That, however, is not a mere assumption. We have some evidence about that. Here’s a passage from my working paper, ChatGPT intimates a tantalizing future, its core LLM is organized on multiple levels, and it has broken the idea of thinking:

With this in mind, I want to turn to some work published Christopher D. Manning et al, in 2020.^[1] They investigated syntactic structures represented in BERT (Bidirectional Encoder Representations from Transformers). Early in the paper they observe:

One might expect that a machine-learning model trained to predict the next word in a text will just be a giant associational learning machine, with lots of statistics on how often the word restaurant is followed by kitchen and perhaps some basic abstracted sequence knowledge such as knowing that adjectives are commonly followed by nouns in English. It is not at all clear that such a system can develop interesting knowledge of the linguistic structure of whatever human language the system is trained on. Indeed, this has been the dominant perspective in linguistics, where language models have long been seen as inadequate and having no scientific interest, even when their usefulness in practical engineering applications is grudgingly accepted.

That is not what they found. They found syntax. They discovered that neural networks induce

representations of sentence structure which capture many of the notions of linguistics, including word classes (parts of speech), syntactic structure (grammatical relations or dependencies), and coreference (which mentions of an entity refer to the same entity, such as, e.g., when “she” refers back to “Rachel”). [...] Indeed, the learned encoding of a sentence to a large extent includes the information found in the parse tree structures of sentences that have been proposed by linguists.

While BERT is a different kind of language technology than GPT, it does seem reasonable to assume that ChatGPT implements syntactic structure as well. Wouldn’t that have been the simplest, most parsimonious, explanation for its syntactic prowess? It would be a mistake, however, to think of story structure as just scaled-up syntactic structure.

^{^}
Christopher D. Manning, Kevin Clark, John Hewitt, Urvashi Khandelwal, and Omer Levy, Emergent linguistic structure in artificial neural networks trained by self-supervision, PNAS, Vol. 117, No. 48, June 3, 2020, pp. 30046-30054, https://doi.org/10.1073/pnas.1907367117

comment by techy · 2023-02-24T20:59:36.924Z · LW(p) · GW(p)

I don't how the analogy with humans help. We don't know the "mechanism" behind how the human mind works. That's not the same as LLMs. We exactly know the mechanism of how it works or produces the output. And the mechanism is no different than what it has been trained to do, i.e. predict the next word. There isn't any other mysterious mechanism at work during inference.

As for plan, it doesn't have any plan. There's no "memory" for it to store a plan. It's just a big complex function that takes an input and produces an output which is the next word. And then repeats the process over and over until it's done

Replies from: bill-benzon

↑ comment by Bill Benzon (bill-benzon) · 2023-02-24T23:35:56.293Z · LW(p) · GW(p)

Those 175B weights? They're all memory.

comment by Bill Benzon (bill-benzon) · 2023-02-24T04:09:07.002Z · LW(p) · GW(p)

A few observations

This conversation has been going on for a few days now and I’ve found it very helpful. I want to take a minute or two to step back and think about it, and about transformers and stories. Why stories? Because I’ve spent a lot of time having ChatGPT tell stories, getting a feel for how it does that. But I’m getting ahead of myself.

I wrote the OP because I felt a mismatch between what I feel to be the requirements for telling the kind of stories ChatGPT tells, and the assertion that it’s “just” predicting the next word, time after time after time. How do we heal that mismatch?

Stories

Let’s start with stories, because that’s where I’m starting. I’ve spent a lot of time studying stories and figuring out how they work. I’ve long ago realized that that process must start by simply describing the story. But describing isn’t so simple. For example, it took Early Modern naturalists decades upon decades to figure out to describe life-forms, plants and animals, well enough so that a naturalist in Paris could read a description by a naturalist in Florence and figure out whether or not that Florentine plant was the same one as he has in front of him in Paris (in this case, description includes drawing as well as writing).

Now, believe it or not, describing stories is not simple, depending, of course, on the stories. The ChatGPT stories I’ve been working with, fortunately, are relatively simple. They’re short, roughly between 200 and 500 words long. The one’s I’ve given the most attention to are in the 200-300 word range.

They are hierarchically structured on three levels: 1) the story as a whole, 2) individual segments within the story (marked by paragraph divisions in these particular stories), and 3) sentences within those segments. Note that, if we wanted to, we could further divide sentences into phrases, which would give us at least one more level, if not two or three. But three levels are sufficient for my present purposes.

Construction

How is it that ChatGPT is able to construct stories organized on three levels? One answer to that question is that it needs to have some kind of procedure for doing that. That sentence seems like little more than a tautological restatement of the question. What if we say the procedure involves a plan? That, it seemed to me when I was writing the OP, that seems better. But “predict the next token” doesn’t seem like much of a plan.

We’re back where we started, with a mismatch. But now it is a mismatch between predict-the-next-token and the fact that these stories are hierarchically structured on three levels.

Let’s set that aside and return to our question: How is it that ChatGPT is able to construct stories organized on three levels? Let’s try another answer to the question. It is able to do it because it was trained on a lot of stories organized on three or more levels. Beyond that, it was trained on a lot of hierarchically structured documents of all kinds. How was it trained? That’s right: Predict the next token.

It seems to me that if it is going to improve on that task, that it must somehow 1) learn to recognize that a string of words is hierarchically structured, and 2) exploit what it has learned in predicting the next token. What cues are in the string that guide ChatGPT in making these predictions?

Whatever those cues are, they are registered in those 175 billion weights. Those cues are what I meant by “plan” in the OP.

Tell me a story

At this point we should be able to pick one of those stories and work our way through it from beginning to end, identifying cues as we go along. Even in the case of a short 200-word story, though, that would be a long and tedious process. At some point, someone HAS to do it, and their work needs to be vetted by others. But we don’t need to do that now.

But I can make a few observations. Here’s the simplest prompt I’ve used: “Tell me a story.” The population of tokens that would be a plausible initial token is rather large. How does that population change as the story evolves?

I’ve done a lot of work with stories generated by this prompt: “Tell me a story about a hero,” That’s still wide open, but the requirement that it be about a hero does place some vague restrictions on the population of available tokens. One story ChatGPT gave me in response to that prompt began with this sentence: “Once upon a time, in a land far, far away, there was a young princess named Aurora.” That’s formulaic, from beginning to end. There are a number of options in the formula, but we could easily use up 200 or 300 words discussing them and laying out the syntactic options in the form of a tree or formula. Let’s assume that’s been done.

What next? Here’s the second sentence: “Aurora was a kind and gentle soul, loved by all who knew her.” It’s all about Aurora. Everything in that sentence is drawn from a token population useful for characterizing Aurora. Third sentence: “She had long, golden hair and sparkling blue eyes, and was known for her beautiful singing voice.” Those tokens are drawn from the same population as the words in the previous sentence.

What about the fourth sentence? Does ChatGPT continue to draw from the same population or does its attention shift to a new population? Note that at some time it is going to have to draw tokens from a new population, otherwise the story goes nowhere. Here’s the fourth sentence: “One day, a terrible dragon came to the kingdom and began to terrorize the people.” That’s a new population of tokens. ChatGPT has moved from the first segment of the story trajectory (as I am calling it) to the second.

You get the idea. I have no intention of continuing on to the end of the story. But you can do so if you wish. Here’s the whole story:

Once upon a time, in a land far, far away, there was a young princess named Aurora. Aurora was a kind and gentle soul, loved by all who knew her. She had long, golden hair and sparkling blue eyes, and was known for her beautiful singing voice.
One day, a terrible dragon came to the kingdom and began to terrorize the people. The dragon was fierce and powerful, and none of the knights or soldiers were able to defeat it. The people lived in fear, not knowing what to do.
But Aurora was not afraid. She knew that she had to do something to save her people, and so she set out to find the dragon. She traveled for days, braving the cold and the snow, until she finally reached the dragon's lair.
With her heart pounding in her chest, Aurora approached the dragon and began to sing. The dragon was enchanted by her beautiful voice, and it slowly began to calm down. Aurora continued to sing, soothing the beast until it was completely calm.
With the dragon no longer a threat, Aurora returned to the kingdom as a hero. The people cheered and celebrated her bravery, and Aurora was hailed as a champion. She lived happily ever after, knowing that she had saved her kingdom and its people.

comment by GenXHax0r · 2023-02-23T23:03:51.308Z · LW(p) · GW(p)

I think there's more going on than successive-word-prediction. Here's my experiment:

https://imgur.com/hhAwpz6

To save having to pull up the image:

me> tell me the end of some sentence

chatgpt> Sure, here's one: "Life is like a camera, focus on the good times and develop from the negatives."

me> tell me just the end

chatgpt> "focus on the good times and develop from the negatives"

me> ok make a new sentence that ends that way

chatgpt> Here's a new sentence: "When faced with challenges, focus on the good times, and develop from the negatives."

It took about 20 seconds of blinking cursor before giving the final response, and the earlier questions in that session were answered in the usual 1 or 2 seconds, so I don't think it was load related. I can't tell if this was evidence it just brute-forced tried enough possibilities to come up with the answer? Is that even compatible with next-word-prediction? Or is this evidence that there was sufficient forward-thinking answer construction that it would effectively be unable to answer correctly word-by-word without "knowing in advance" what the entire response was going to be?

Replies from: bill-benzon

↑ comment by Bill Benzon (bill-benzon) · 2023-02-23T23:25:01.324Z · LW(p) · GW(p)

Interesting.

I've had times when it took 10s of seconds or even over a minute to respond. And I've had occasions when it didn't respond at all, or responded with an error condition after having eaten up over a minute of time. At one point I even considered timing its responses. But it's a public facility fielding who knows how many queries a second. So I don't know quite what to make of response times, even extremely long lags.

Replies from: GenXHax0r

↑ comment by GenXHax0r · 2023-02-23T23:47:26.969Z · LW(p) · GW(p)

I suppose it's certainly possible the longer response time is just a red herring. Any thoughts on the actual response (and process to arrive thereon)?

Edit, for clarity, I mean how would it arrive at a grammatically and semantically correct response if it were only progressing successively one word at a time, rather than having computed the entire answer in advance and then merely responding from that answer one word at a time?

For further clarity: I gave it no guidance tokens, so the only content it had to go off is the sentence it generated on its own. Is the postulate then that its own sentence sent it somewhere in latent space and from there it decided to start at "When", then checked to see if it could append the given end-of-sentence text to create an answer? With the answer being "no" then for next token from that same latent space it pulled "faced", and checked again to see if it could append the sentence remainder? Same for "with", "challenges", "remember", "to", "keep", "a", "positive", and then after responding with "attitude" upon next token it decides it's able to proceed from the given sentence-end-text? It seems to me the alternative is that it has to be "looking ahead" more than one token at a time in order to arrive at a correct answer.

Replies from: Nanda Ale

↑ comment by Nanda Ale · 2023-02-24T00:32:59.758Z · LW(p) · GW(p)

>I suppose it's certainly possible the longer response time is just a red herring. Any thoughts on the actual response (and process to arrive thereon)?

Just double checking, I'm assuming all token take the same amount of time to predict in regular transformer models, the kind anyone can run on their machine right now? So ChatGPT if it varies, it's different? (I'm not technical enough to answer this question, but presumably it's an easy one for anyone who is.)

One simple possibility is that it might be scoring the predicted text. So some questions are fine on the first try, while others it generates 5 responses and picks the best, or whatever. This is basically what I do personally when using GPT, and you can kind of automate it by asking GPT to criticize its own answers.

FWIW my anecdotal experience with ChatGPT is that it does seem to take longer to think on more difficult requests. But I'm only thinking on past experience, I didn't try to test this specifically.

Replies from: GenXHax0r

↑ comment by GenXHax0r · 2023-02-24T00:36:41.548Z · LW(p) · GW(p)

That's basically what I was alluding to by "brute-forced tried enough possibilities to come up with the answer." Even if that were the case, the implication is that it is actually constructing a complete multi-token answer in order to "test" that answer against the grammatical and semantic requirements. If it truly were re-computing the "correct" next token on each successive iteration, I don't see how it could seamlessly merge its individually-generated tokens with the given sentence-end text.

comment by Seth Wieder (seth-wieder) · 2023-02-21T21:02:14.795Z · LW(p) · GW(p)

There's a lot of speculation for how these models operate. You specifically say "you don't know" how it works, but are suggesting the idea it has some sort of planning phase.

As Wolfram explains, the Transformer architecture predicts one word at a time based on the previous inputs run through the model.

Any planning you think you see, is merely a trend based on common techniques for answering questions. The 5 sections of storytelling is an established technique that is commonly used in writing and thus embedded in the training of the model and seen in it's responses.

In the future, these models could very well have planning phases - and more than next word prediction aligned with commons writing patterns.

Replies from: bill-benzon

↑ comment by Bill Benzon (bill-benzon) · 2023-02-21T21:11:21.069Z · LW(p) · GW(p)

If you look at the other comments I've made today you'll see that I've revised my view somewhat.

As for real planning, that's certainly what Yann Lecun talked about in the white paper he uploaded last summer.

comment by Bill Benzon (bill-benzon) · 2023-02-21T16:40:23.541Z · LW(p) · GW(p)

This whole conversation has been very helpful. Thanks for your time and interest.

Some further thoughts:

First, as I’ve suggested in the OP, I am using the term “story trajectory” to refer the complete set of token-to-token transitions ChatGPT makes in the course of telling story. The trajectories for these stories have five segments. Given this, it seems clear to me that these stories are organized on three levels: 1) individual sentences, 2) sentences within a segment of the story trajectory, and 3) the whole story trajectory.

That gives us three kinds of transition from one token to the next: 1) from one word to the next word within a sentence, 2) from the last word of a sentence (ChatGPT treats end-punctuation as a word, at least that’s what it told me when I was asking it to count the number of words in a sentence) to first word of the next sentence, and 3) from the last word in one trajectory segment to the first word in the next segment. We also have the beginning transition and the concluding transition. The beginning transition moves from the final token of the prompt to the first token in the story. The concluding transition moves from the last token of the story, to what is I assume a wait state. I note that on a few occasions ChatGPT has concluded with “The end.” on a single line, but that is relatively rare.

That gives us fives kinds of token-to-token transition, three kinds within the story, and then a pair that bracket the story. Something different happens in each case. But in all cases, except the story end, we’re dealing with next token prediction. What accounts for the differences between those kinds of next-token prediction? It seems to me that the context changes and that changes the relevant probability distribution. Those context changes are the "plan."

comment by Stanley Ihesiulo (stanley-ihesiulo) · 2023-02-21T16:34:11.422Z · LW(p) · GW(p)

I do not get your argument here, it doesn't track. I am not an expert in transformer systems or the in-depth architecture of LLMs, but I do know enough to make me feel that your argument is very off.

You argue that training is different from inference, as a part of your argument that LLM inference has a global plan. While training is different from inference, it feels to me that you may not have a clear idea as to how they are different.

You quote the accurate statement that "LLMs are produced by a relatively simple training process (minimizing loss on next-token prediction, using a large training set from the internet..."

Training, intrinsically, involves inference. Training USES inference. Training is simply optimizing the inference result by, as the above quote implies, "minimizing loss on [inference result]". Next-token prediction IS the inference result.

You can always ask the LLM, without having it store any state, to produce the next token, and do that again, and do that again, e.t.c. it doesn't have any plans, it is just using the provided input, performing statistical calculations on it, and producing the next token. That IS prediction. It doesn't have a plan, doesn't store a state, just using weights and biases(denoting the statistically significant ways of combining the input to produce a hopefully near-optimal output), and numbers (like the query, key, and value) denoting the statistical significance of the input text in relation to itself, and it predicts, through that statistical process, the next token. It doesn't have a global plan.

Replies from: bill-benzon

↑ comment by Bill Benzon (bill-benzon) · 2023-02-21T21:07:54.563Z · LW(p) · GW(p)

Thanks for reminding me that training uses inference.

As for ChatGPT having a global plan, as you can see if you look at the comments I've made earlier today, I have come around to that view. The people that wrote the stories ChatGPT consumed during training, they had plans, and those plans are reflected in the stories they wrote. That structure is “smeared” over all those parameters weights and gets “reconstructed” each time ChatGPT generates a new token.

In his last book, The Computer and the Brain, John von Neumann noted, quite correctly, that each neuron is both a memory store and a processor. Subsequent research has made it clear that the brain stores specific things – objects, events, plans, whatever ¬– in populations of neurons, not individual neurons. These populations operate in parallel.

We don’t yet have the luxury of such processors so we have to make do with programming a virtual neural net to run on a processor having way more memory units than processing units. And so our virtual machine has to visit each memory unit every time it makes one step in its virtual computation.

Replies from: james-wolf

↑ comment by james wolf (james-wolf) · 2023-02-22T03:59:04.272Z · LW(p) · GW(p)

It does seem like there are “plans” or formats in place, not just choosing the next best word.

When it creates a resume , or a business plan or timeline, it seems much more likely that there is some form of structure that it’s is using and a template and then choosing the words that would go best in there correct places.

Stories have a structure , beginning middle end. So it’s not just picking words it’s picking the words that go best with a beginning then the words that go best middle and then end. If it was just choosing next words you could imagine it being a little more creative and less formulaic.

This model was trained by humans , who told it when it had the structure right , and the weights got placed heavier where it conformed to the right preexisting plan. So if any thing the “neural” pathways that formed the strongest connections are ones that 1. Resulted in the best use of tokens 2. Were weighted deliberately higher by the human trainers

comment by Ben (ben-lang) · 2023-02-21T10:40:02.012Z · LW(p) · GW(p)

I don't think the story structure is any compelling evidence against it being purely next token prediction. When humans write stories it is very common for them to talk about a kind of flow-state where they have very little idea what the next sentence is going to say until they get there. Story's made this way still have the beginning middle and end, because if you have nothing written so far you must be at the beginning. If you can see a beginning you must be in the middle, and so on. Sometimes these stories just work, but more often the ending needs a bit of fudging, or else you need to go back and edit earlier bits to put things in place for the ending. (A fudge would be some kind of "and then all the problems were resolved"). Having played with GPT a little, it fudges its endings a lot.

I am not saying that it is purely next token prediction, I am just dubious about your evidence that it is not.

Replies from: bill-benzon

↑ comment by Bill Benzon (bill-benzon) · 2023-02-21T12:11:05.963Z · LW(p) · GW(p)

Quick reply, after doing a bit of reading and recalling a thing or two: In a 'classical' machine we have a clean separation of process and memory. Memory is kept on the paper tape of our Turing Machine and processing is located in, well, the processor. In a connectionist machine process and memory are all smushed together. GPTs are connectionist virtual machines running on a classical machine. The "plan" I'm looking for is stored in the parameter weights, but it's smeared over a bunch of them. So this classical machine has to visit every one of them before it can output a token.

So, yes, purely next token prediction. But the prediction cycle, in effect, involves 'reassembling' the plan each time through.

To my mind, in order to say we "understand" how this puppy is telling a story, we need to say more than it's a next-token-prediction machine. We need to say something about how that "plan" is smeared over those weights. We need to come up with concepts we can use in formulating such explanations. Maybe the right concepts are just laying scattered about in dusty old file cabinets someplace. But, I'm thinking this is likely, we have to invent some new ones as well.

Wolfram was trained as a physicist. The language of complex dynamics is natural to him, whereas it's a poorly learned third or fourth language for me, So he talks of basins of attractors and attractor landscapes. As far as I can tell, in his language, those 175B parameters can be said to have an attractor landscape. When ChatGPT tells a story it enters the Story Valley in that landscape and walks a path through that valley. When its done with the story, it exits that valley. There are all kinds of valleys (and valleys within valleys (and valleys within them)) in the attractor landscape, for all kinds of tasks.

FWIW, the human brain has roughly 86B neurons. Each of those is connected with roughly 10K other neurons. Those connections are mediated by upward of a 100 different chemicals. And those neurons are surrounded by glial cells. In the old days researchers thought those glial cells were like packing peanuts for the neural net. We now know better and are beginning to figure out what they're doing. Memory is definitely part of their story. So we've got to add them into the mix. How many glial cells per neuron? There might be a number in the literature, but I haven't checked. Anyhow, the number of parameters we need to characterize a human brain is vast.

comment by Bill Benzon (bill-benzon) · 2023-02-23T09:03:31.281Z · LW(p) · GW(p)

A story within a story.

I want you to tell a story within a story. Imagine that Frank is walking in the woods with his young daughter, Jessie. They come across the carcass of a dead squirrel. Jesse is upset, so Frank tells her a story to calm her down. When he finishes the story, they continue on the walk where the arrive at the edge of a beautiful pool deep in the forest. They pause for a moment and then return home.

As Frank and Jessie walked through the woods, they stumbled upon the lifeless body of a small grey squirrel lying on the ground. Jessie was visibly upset at the sight of the dead animal, tears streaming down her face as she clung tightly to her father's hand.

Frank knelt down beside her and gently wiped away her tears, assuring her that the squirrel was no longer in pain and that it had now moved on to a better place. However, Jessie remained inconsolable, her tiny frame shaking with sobs.

Determined to distract her from her sadness, Frank began to weave a story about a brave little squirrel named Sammy, who lived deep in the woods with his family. Sammy was a fearless adventurer, always seeking out new challenges and discovering exciting places in the forest that no other squirrel had ever seen before.

As Frank spoke, Jessie's eyes widened with wonder, her tears slowly drying up as she became engrossed in the tale. She listened with rapt attention as Sammy encountered all sorts of obstacles on his journeys, from treacherous ravines to thundering waterfalls.

But no matter what dangers he faced, Sammy always emerged victorious, his bravery and cunning helping him to overcome even the most difficult obstacles. By the end of the story, Jessie was smiling again, her sadness replaced with a sense of awe and wonder.

As they continued on their walk, Frank and Jessie eventually came upon a stunningly beautiful pool nestled deep in the heart of the forest. The water was crystal-clear, shimmering in the sunlight that filtered through the trees overhead.

For a moment, father and daughter stood in awe, taking in the breathtaking sight before them. Then, with a sigh, Frank took Jessie's hand once more, and they turned to make their way back home.

Although the memory of the dead squirrel lingered in their minds, the story of Sammy and the beauty of the forest pool brought them both a sense of comfort and peace. And as they walked, they talked and laughed, cherishing the precious moments they shared together in the great outdoors.

Can you give me a version of the story when Frank starts the story about Sammy with the words, “Once upon a time.”

Certainly! Here is an alternate version of the story within a story, with Frank starting off with the classic phrase, "Once upon a time."

Once upon a time, as Frank and his daughter Jessie were walking through the woods, they came across the body of a small grey squirrel lying on the ground. Jessie was distraught at the sight, tears streaming down her face as she clung tightly to her father's hand.

Frank knelt down beside her and wiped away her tears, telling her that the squirrel had moved on to a better place and was no longer in pain. But Jessie remained inconsolable, so Frank began to tell her a story to distract her from her sadness.

"Once upon a time," Frank began, "there was a brave little squirrel named Sammy who lived deep in the forest with his family. Sammy was an adventurer, always seeking out new challenges and discovering exciting places in the woods that no other squirrel had ever seen before."

Jessie's eyes widened with wonder as she listened to the story of Sammy's adventures. She heard about how he traversed treacherous ravines, climbed towering trees, and even crossed thundering waterfalls. Despite the many dangers he faced, Sammy remained brave and resourceful, always finding a way to overcome the obstacles in his path.

By the end of the story, Jessie was smiling again, her sadness replaced with a sense of awe and admiration for Sammy's courage. As they continued on their walk, Frank and Jessie eventually came upon a beautiful pool nestled deep in the heart of the forest.

The water was crystal-clear, shimmering in the sunlight that filtered through the trees overhead. For a moment, father and daughter stood in wonder, taking in the breathtaking sight before them.

Then, with a contented sigh, Frank took Jessie's hand once more, and they turned to make their way back home. Although the memory of the dead squirrel lingered in their minds, the story of Sammy and the beauty of the forest pool brought them both comfort and joy.

Very good. Let’s keep that but add to it in another version. This time Jessie and Frank have a conversation as they walk back home.

Certainly! Here's another version of the story within a story, with Frank and Jessie having a conversation as they walk back home after encountering the dead squirrel and hearing the story of Sammy.

The water was crystal-clear, shimmering in the sunlight that filtered through the trees overhead. For a moment, father and daughter stood in wonder, taking in the breathtaking sight before them.

"Isn't it beautiful, Daddy?" Jessie said, breaking the silence.

Frank smiled at his daughter. "Yes, it is, sweetheart. It's amazing how nature can create something so stunning."

Jessie nodded, still looking out at the pool. "Do you think Sammy ever found something like this on his adventures?"

Frank chuckled. "Who knows? Maybe he did. But I think what's most important is that he was always curious and adventurous, and he never stopped exploring the world around him."

Jessie nodded again, thoughtful. "I want to be like Sammy when I grow up," she said.

Frank ruffled her hair affectionately. "You already are, Jessie. You're brave and curious, just like Sammy. And you always find a way to overcome the challenges in your path."

Jessie beamed at her father, feeling a sense of pride and warmth in her chest. They continued on their walk, chatting and laughing together as they made their way back home. Although the memory of the dead squirrel lingered in their minds, the story of Sammy and the beauty of the forest pool brought them both comfort and joy.

X X X X X X

FWIW: Notice that it did not begin the story with the phrase, "Once upon a time." But once I'd asked it to use that phrase on the inner story, it also used it on the outer.

comment by Victor Novikov (ZT5) · 2023-02-22T14:30:21.056Z · LW(p) · GW(p)

I understand your argument as something like "GPT is not just predicting the next token because it clearly 'plans' further ahead than just the next token".

But "looking ahead" is required to correctly predict the next token and (I believe) naturally flows of the paradigm of "predicting the next token".

Replies from: bill-benzon

↑ comment by Bill Benzon (bill-benzon) · 2023-02-22T15:23:52.439Z · LW(p) · GW(p)

That is, based on past experience in similar contexts, it makes its best guess about what will happen next. Is that right? How far back does it look?

I've been examining stories that are organized on three levels: 1) the whole story, 2) major segments, and 3) sentences within major segments. The relevant past differs within those segments.

At the level of the whole story, at the beginning the relevant past is either prompt that gave rise to the story, or some ChatGPT text in which a story is called for. At the end of the story, ChatGPT may go into a wait state if it is responding to an external prompt, or pick up where it left off if it told the story in the context of something else – a possibility I think I'll explore a bit. The the level of a major segment, the relevant context is the story up to that point. And at the level of the individual sentence the relevant context is the segment up to that point.

Replies from: ZT5

↑ comment by Victor Novikov (ZT5) · 2023-02-22T15:46:56.251Z · LW(p) · GW(p)

My model is that LLMs use something like "intuition" rather than "rules" to predict text - even though intuitions can be expressed in terms of mathematical rules, just more fluid ones than we usually see "rules".

My specific guess is that the gradient descent process that produced GPT has learned to identify high-level patterns/structures in texts (and specifically, stories), and uses them to guide its prediction.

So, perhaps, as it is predicting the next token, it has a "sense" of:
-that the text it is writing/predicting is a story
-what kind of story it is
-which part of the story it is in now
-perhaps how the story might end (is this a happy story or a sad story?)

This makes me think of top-down vs bottom-up processing. To some degree, the next token is predicted by the local structures (grammar, sentence structure, etc). To some degree, the next token is predicted by the global structures (the narrative of a story, the overall purpose/intent of the text). (there are also intermediate layers of organization, not just "local" and "global"). I imagine that GPT identifies both the local structures and the global structures (has neuron "clusters" that detect the kind of structures it is familiar with), and synergizes them into its probability outputs for next token prediction.

Replies from: bill-benzon

↑ comment by Bill Benzon (bill-benzon) · 2023-02-22T20:02:07.185Z · LW(p) · GW(p)

Makes sense to me.

I wonder if those induction heads identified by the folks at Anthropic played a role in identifying those "high-level patterns/structures in texts..."

comment by gideonite · 2023-02-28T15:25:48.234Z · LW(p) · GW(p)

Likewise, LLMs are produced by a relatively simple training process (minimizing loss on next-token prediction, using a large training set from the internet, Github, Wikipedia etc.) but the resulting 175 billion parameter model is extremely inscrutable.
So the author is confusing the training process with the model. It’s like saying “although it may appear that humans are telling jokes and writing plays, all they are actually doing is optimizing for survival and reproduction”. This fallacy occurs throughout the paper.

The train/test framework is not helpful for understanding this. The dynamical system view is more useful (though beware that this starts to get close to the term "emergent behavior" which we must be wary of). The interesting thing about chaos is that, while the behavior is not perfectly predictable, maybe even surprising, it has well-defined properties and mathematical constraints. Everything is not possible. The Lorenz System has finite support. In the same spirit, we need to take a step back and realize that the kind of "real AI" that people are afraid of would require causal modeling which is mathematically impossible to construct using correlation only. If the model is able to start making interventions in the world, then we need to consider the possibility that it will be able to construct a casual model. But this goes beyond predicting the next word, which is the scope of this article.

Replies from: bill-benzon

↑ comment by Bill Benzon (bill-benzon) · 2023-02-28T19:32:27.259Z · LW(p) · GW(p)

What I'm arguing is that what LLMs does go way beyond predicting the next word. That's just the proximal means to an end, which is a coherent statement.

The idea that ChatGPT is simply “predicting” the next word is, at best, misleading

Contents

88 comments

A few observations

Stories

Construction

Tell me a story

A story within a story.