Language models can generate superior text compared to their input
post by ChristianKl · 2023-01-17T10:57:10.260Z · LW · GW · 28 commentsContents
30 comments
There’s a frequent misconception that assumes that a large language model will never achieve superhuman text creation ability because such models try to create texts that are maximally unsurprising. This article will explain why that assumption is wrong.
In 1906, Sir Francis Galton conducted an experiment at a fair, where he asked fair-goers to guess the weight of an ox in a weight-judging competition. The median of 787 guesses was 1,207 pounds, while the actual weight of the ox was 1,198 pounds. The error in making guesses was a result of a combination of systematic bias and random noise. The fair-goers, having knowledge of oxen, had no bias in their guesses, thus the error was entirely due to random noise. By polling the 787 guesses, Galton averaged out the random noise of each individual guess.
This phenomenon was coined wisdom of the crowd. In areas where reasoning errors are mostly random noise, crowds are smarter than individual members of the crowd. By training on large data sets, large language models can access the wisdom of the crowd. The ceiling of the ability of a large language model is the wisdom of the crowd instead of the wisdom of individual members of the crowd.
The fact that each word of a text is massively unsurprising based on preceding words in the text does not imply that the text overall would be massively unsurprising. If you have a text you can calculate for every word in the text the likelihood (Ltext) how likely it would follow the preceding words in the text. You can also calculate the likelihood (Lideal) of the most likely word that would follow the preceding text.
Lideal - Ltext is noise. If you look at a given text you can calculate the average of the noise for each word. A well-trained large language model is able to produce texts with a lot less noise than the average of the text in its training corpus.
For further reading, Kahneman wrote Noise: A Flaw in Human Judgment which goes into more detail on how a machine learning model can eliminate noise and thus make better decisions than the average of its training data.
28 comments
Comments sorted by top scores.
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2023-01-17T14:56:13.198Z · LW(p) · GW(p)
Arbitrarily good prediction of human-generated text can demand arbitrarily high superhuman intelligence.
Simple demonstration #1: Somewhere on the net, probably even in the GPT training sets, is a list of <hash, plaintext> pairs, in that order.
Simple demonstration #2: Train on only science papers up until 2010, each preceded by date and title, and then ask the model to generate starting from titles and dates in 2020.
Replies from: janus, ChristianKl, lahwran, None↑ comment by janus · 2023-01-18T00:34:48.392Z · LW(p) · GW(p)
My reply to a similar statement Eliezer made on Twitter today:
Reversing hashes nicely drives home that the training objective is technically vastly superhuman. But such far-fetched examples might give the impression that we're not going to get any superhuman capabilities realistically/any time soon with SSL.
There are much more tractable superhuman capabilities that I expect current and near future LLMs to learn, such as having a much more universal "type of guy" prior than any individual human, modeling statistical regularities that no humans think about, stream-of-thought outputting of types of text that would require painstaking effort/ revision/collaboration for humans to create. Etc.
Statistical analysis of an audio recording of someone typing at a keyboard is sufficient to reverse engineer keystrokes. Clever humans figured this out, but there are many more evidential entanglements like this lurking that no one has thought of, but will be transparent to LLMs.
The 2020 extrapolation example gets at a more realistic class of capability that even GPT-3 has to a nonzero extent, and which will scale more continuously in the current regime with practical implications.
↑ comment by ChristianKl · 2023-01-17T15:54:30.549Z · LW(p) · GW(p)
It's not clear that it's possible for a transformer model to do #2 no matter how much training went into it.
Replies from: Eliezer_Yudkowsky↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2023-01-17T17:00:04.465Z · LW(p) · GW(p)
It'd take less computing power than #1.
Replies from: ChristianKl, donald-hobson↑ comment by ChristianKl · 2023-01-17T21:08:34.112Z · LW(p) · GW(p)
Scientific papers describe facts about the real world that aren't fully determined by previous scientific papers.
Take for example the scientific papers describing a new species of bacteria that was unknown a decade earlier. Nothing in the training data describes it. You can also not determine the properties of the species based on first principles.
On the other hand, it might be possible to figure out an algorithm that does create texts that fit to given hash values.
Replies from: ingalala↑ comment by ingalala · 2023-01-18T00:29:59.675Z · LW(p) · GW(p)
If you are intelligent enough, you can deduce the laws of the universe from a surprisingly small amount of data. In the vein of your example, there is the story of Darwin deducing the existence of a moth with a long proboscis after seeing an orchid with a particular shape, and proving to be right. Perhaps papers from pre-2010 don't have the right models, but maybe they have enough information and data for a sufficiently intelligent being to piece together from them whatever is missing?
Replies from: ChristianKl↑ comment by ChristianKl · 2023-01-18T02:10:56.619Z · LW(p) · GW(p)
You can piece together some things, but there's a lot of randomness in our world. A lot of important science is about discovering black swans.
Replies from: ChaseDanton↑ comment by Eric Zhang (ChaseDanton) · 2023-01-19T19:35:43.547Z · LW(p) · GW(p)
Some things is enough, you'd still get less loss if you're just right about the stuff that can be pieced together.
↑ comment by Donald Hobson (donald-hobson) · 2023-01-26T01:40:47.417Z · LW(p) · GW(p)
Sure. What isn't clear is that you get a real paper from 2020, not a piece of fiction that could have been written in 2010. (Or just a typo filled science paper)
↑ comment by the gears to ascension (lahwran) · 2023-01-18T06:16:38.556Z · LW(p) · GW(p)
Simple demonstration #2: Train on only science papers up until 2010, each preceded by date and title, and then ask the model to generate starting from titles and dates in 2020.
Arbitrarily superintelligent non-causally-trained models will probably still fail at this. IID breaks that kind of prediction. you'd need to train them in a way that makes causally invalid models implausible hypotheses.
But, also, if you did that, then yes, agreed.
↑ comment by [deleted] · 2023-01-17T15:45:38.940Z · LW(p) · GW(p)
These demonstrations seem like grossly over-simplified conjectures. Is this just a thought experiment or actual research interests in the field?
Replies from: Eliezer_Yudkowsky↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2023-01-17T17:02:01.947Z · LW(p) · GW(p)
They're folk theorems, not conjectures. The demonstration is that, in principle, you can go on reducing the losses at prediction of human-generated text by spending more and more and more intelligence, far far past the level of human intelligence or even what we think could be computed by using all the negentropy in the reachable universe. There's no realistic limit on required intelligence inherent in the training problem; any limits on the intelligence of the system come from the limitations of the trainer, not the loss being minimized as far as theoretically possible by a moderate level of intelligence. If this isn't mathematically self-evident then you have not yet understood what's being stated.
Replies from: None↑ comment by [deleted] · 2023-01-17T17:27:03.595Z · LW(p) · GW(p)
No, I didn't understand what you said. It seemed like you simplified ML systems with a look up table in #1. In #2, it seems like you know what exactly is used to train these systems, and somehow papers before or after 2010 is of meaningful indicators for ML systems, which I don't know where the reasoning came from. My apologies for not being knowledgeable in this area.
Replies from: lelapin↑ comment by Jonathan Claybrough (lelapin) · 2023-01-18T18:05:03.841Z · LW(p) · GW(p)
The two examples were (mostly) unrelated and served to demonstrate two cases where a perfect text predictor needs to do incredibly complex calculation to correctly predict text. Thus a perfect text predictor is vast superintelligence (and we won't achieve perfect text prediction, but as we get better and better we might get closer to superintelligence)
In the first case, if the training data contains series of [hash] then [plain text], then a correct predictor must be able to retrieve the plain text from the hash (and because there are multiple plain texts with the same hash, it would have to calculate through all of them and evaluate which is most probable to appear). Thus correctly predicting text can mean being able to calculate an incredibly large amount of hashes on all combinations of text of certain lengths and evaluating which is the most probable.
In the second case, the task is to predict future papers based on past papers, which is kinda obviously very hard.
↑ comment by [deleted] · 2023-01-18T18:49:08.508Z · LW(p) · GW(p)
It doesn't seem clear to me what those two demonstrations are trying to test. 1 seems like a case of over-fitting. 2 seems like an extension of 1 except it's the case with papers, not sure how the papers case has anything to do with the generalized capabilities of ChatGPT. If you think ChatGPT is merely a complex lookup-table, then I don't really know what to say. Lookup-table or NLP, I don't know how either has much to do with general intelligence. Both are models that may seem intelligent if that's where the discussion is focusing on. Honestly, I don't really understand a lot of the stuff discussed on this site.
comment by Rafael Harth (sil-ver) · 2023-01-17T11:55:28.262Z · LW(p) · GW(p)
This is a really good point, but it only shows that superhuman reasoning might be possible, not that it is. Like, it's possible to the extent that the transition functions humans can produce are restricted by noise rather than bias. But it's unclear (at least to me) why bias can't be most of the story.
comment by Tomás B. (Bjartur Tómas) · 2023-01-17T16:26:35.548Z · LW(p) · GW(p)
My experience over the past few years has been one of being surprised by latent capacities in existing models. A lot of stuff like prompt engineering, fine tuning, chain of thought, Open-AI-style "alignment" can be seen as not so much creating new capacities as revealing/refining latent ones. Back when GPT-3 was new, Connor Leahy said something like "GPT-3 is already general intelligence" which sounded like hyperbole to me at the time, and seems less so now.
Though RSI still seems very plausible to me [LW · GW], one scenario I've started thinking about is a massive effective capabilities gain caused not by RSI or any non-trivial algorithmic improvement, but just the dissolution of a much larger than anticipated "latent capacities overhang".
Possibly an absurd and confused scenario, but is it that implausible that some day we will get a model that still seems kinda dumb but is in fact one prompt away from super-criticality?
Replies from: ChristianKl↑ comment by ChristianKl · 2023-01-18T00:02:31.564Z · LW(p) · GW(p)
You don't need to change anything in the underlying machine learning algorithms to make a model like ChatGPT generate new training data that could be used for recursive self-improvement.
Especially, if you give it access to a console so that it can reliably run code, it could create its own training data and get into recursive self-improvement.
If you for example want it to learn to reliably multiply two 4-digit numbers you can randomly generate 4-digit numbers. Then you let it generate a text answer with individual steps. You let a second model create python code to validate all the individual calculations in the individual steps. If the python code validates that all the calculations are correct, you can have a new piece of training data on how to multiply two 4-digit numbers.
Based on ChatGPT user data it might be possible to create an automated system that finds problems where ChatGPT currently most of the time gives a wrong answer and figure out how to create code that analyses newly created examples to see whether they are correct.
Replies from: Ericf↑ comment by Ericf · 2023-01-18T00:20:11.448Z · LW(p) · GW(p)
I'll just note here that "ability to automate the validation" is only possible when we already know the answer. Since the automated loom, computers have been a device for doing the same thing, over and over, very fast.
Replies from: ChristianKl↑ comment by ChristianKl · 2023-01-18T02:06:42.473Z · LW(p) · GW(p)
You don't necessarily need to know the correct answer beforehand to be able to validate whether or not an answer is correct. If we take Eliezer's problem of generating text that matches a given hash value, it's easy to validate whether an answer is true or not even if you don't know the answer beforehand.
What's important is that the AI is sometimes able to generate correct answers. If the criteria for a correct answer are well-defined enough it can go from solving a problem 1% of the time correctly to solving it 100% of the time correctly.
ChatGPT is used by millions of people and a good portion of that will click the feedback button, especially if they optimize their UI for that. It's possible to build automated processes that will look at the problems where it currently frequently makes mistakes and learn to avoid them. It is possible to build a self-improving system around that.
If you let it do that for 10,000 different problems I would expect that it learns some reasoning habits that generalize and are useful for solving other problems as well.
comment by Yair Halberstadt (yair-halberstadt) · 2023-01-17T11:35:25.929Z · LW(p) · GW(p)
You've failed to convince me that "Lideal - Ltext is noise", or even offer any arguments for that. Could you elaborate more please? This seems potentially very interesting and relevant.
Replies from: gjm, ChristianKl↑ comment by ChristianKl · 2023-01-17T12:42:42.338Z · LW(p) · GW(p)
Signal and noise depend a bit on the perspective. I apply the conception from Kahnemann and if you are interested in more, reading Kahnemann's latest book or listening to a podcast of him speak about the book is good.
Kahnemann does go through examples of an insurance company that treated the interrater quotes for insurance policies as noise and then used machine learning to cut down on that noise.
It's worth noting that low noise is not universally desirable. I remember some VC firm that said they had a policy that if all partners thought it should invest in a company that it would not invest. That's because the wisdom of the crowd does not make good VC investment decisions.
comment by janus · 2023-01-18T00:10:56.853Z · LW(p) · GW(p)
True.
Something which can predict everything all the time is more formidable than any demonstrator it predicts: the upper bound of what can be learned from a dataset is not the most capable trajectory, but the conditional structure of the universe implicated by their sum (though it may not be trivial to extract that knowledge).
-- Simulators [LW · GW]
comment by Olli Järviniemi (jarviniemi) · 2023-01-17T14:03:02.139Z · LW(p) · GW(p)
The fair-goers, having knowledge of oxen, had no bias in their guesses
[EDIT: I read this as "having no knowledge of oxen" instead of "having knowledge of oxen" - is this what you meant? The comment seems relevant nevertheless.]
This does not follow: It is entirely possible that the fair-goers had no specific domain knowledge of oxen, while still having biases arising from domain-general reasoning. And indeed, they probably knew something about oxen -- from Jaynes' Probablity Theory:
The absurdity of the conclusion [that polling billion people tells the height of China's emperor with accuracy 0.03 mm] tells us rather forcefully that the √N rule is not always valid, even when the separate data values are causally independent; it is essential that they be logically independent. In this case, we know that the vast majority of the inhabitants of China have never seen the Emperor; yet they have been discussing the Emperor among themselves, and some kind of mental image of him has evolved as folklore. Then, knowledge of the answer given by one does tell us something about the answer likely to be given by another, so they are not logically independent. Indeed, folklore has almost surely generated a systematic error, which survives the averaging; thus the above estimate would tell us
something about the folklore, but almost nothing about the Emperor.
comment by Ilio · 2023-01-17T18:35:53.409Z · LW(p) · GW(p)
The main point is correct, but maybe you should mention your demonstration is especially easy, but not necessarily the main reason (unless that’s what you think?). Also:
The fair-goers, having knowledge of oxen, had no bias in their guesses, thus the error was entirely due to random noise.
If you meant the crowd had no bias on average, that’s indeed the idea. But one can read your sentence as meaning that each individual had no bias, which would break the whole wisdom of crowd idea (because then Galton wouldn’t need a crowd: he could simply repeat the measurement process in one individual).
comment by Satori Atman (satori-atman) · 2023-01-18T08:12:29.217Z · LW(p) · GW(p)
I'm so happy that I didn't go to sleep, because I got to read this masterpiece of an article as soon as it was published.