Posts
Comments
There was one comment on twitter that the RLHF-finetuned models also still have the ability to play chess pretty well, just their input/output-formatting made it impossible for them to access this ability (or something along these lines). But apparently it can be recovered with a little finetuning.
The paper seems to be about scaling laws for a static dataset as well?
Similar to the initial study of scale in LLMs, we focus on the effect of scaling on a generative pre-training loss (rather than on downstream agent performance, or reward- or representation-centric objectives), in the infinite data regime, on a fixed offline dataset.
To learn to act you'd need to do reinforcement learning, which is massively less data-efficient than the current self-supervised training.
More generally: I think almost everyone thinks that you'd need to scale the right thing for further progress. The question is just what the right thing is if text is not the right thing. Because text encodes highly powerful abstractions (produced by humans and human culture over many centuries) in a very information dense way.
Related: https://en.wikipedia.org/wiki/Secretary_problem
The interesting thing is that scaling parameters (next big frontier models) and scaling data (small very good models) seems to be hitting a wall simultaneously. Small models now seem to get so much data crammed into them that quantisation becomes more and more lossy. So we seem to be reaching a frontier of the performance per parameter-bits as well.
I think the evidence mostly points towards 3+4,
But if 3 is due to 1 it would have bigger implications about 6 and probably also 5.
And there must be a whole bunch of people out there who know wether the curves bend.
It's funny how in the OP I agree with master morality and in your take I agree with slave morality. Maybe I value kindness because I don't think anybody is obligated to be kind?
Anyways, good job confusing the matter further, you two.
I actually originally thought about filtering with a weaker model, but that would run into the argument: "So you adversarially filtered the puzzles for those transformers are bad at and now you've shown that bigger transformers are also bad at them."
I think we don't disagree too much, because you are too damn careful ... ;-)
You only talk about "look-ahead" and you see this as on a spectrum from algo to pattern recognition.
I intentionally talked about "search" because it implies more deliberate "going through possible outcomes". I mostly argue about the things that are implied by mentioning "reasoning", "system 2", "algorithm".
I think if there is a spectrum from pattern recognition to search algorithm there must be a turning point somewhere: Pattern recognition means storing more and more knowledge to get better. A search algo means that you don't need that much knowledge. So at some point of the training where the NN is pushed along this spectrum much of this stored knowledge should start to be pared away and generalised into an algorithm. This happens for toy tasks during grokking. I think it doesn't happen in Leela.
I do have an additional dataset with puzzles extracted from Lichess games. Maybe I'll get around to running the analysis on that dataset as well.
I thought about an additional experiment one could run: Finetuning on tasks like help mates. If there is a learned algo that looks ahead, this should work much better than if the work is done by a ton of pattern recognition which is useless for the new task. Of course the result of such an experiment would probably be difficult to interpret.
I know, but I think Ia3orn said that the reasoning traces are hidden and only a summary is shown. And I haven't seen any information on a "thought-trace-condenser" anywhere.
There is a thought-trace-condenser?
Ok, then the high-level nature of some of these entries makes more sense.
Edit: Do you have a source for that?
No, I don't - but the thoughts are not hidden. You can expand them unter "Gedanken zu 6 Sekunden".
Which then looks like this:
I played a game of chess against o1-preview.
It seems to have a bug where it uses German (possible because of payment details) for its hidden thoughts without really knowing it too well.
The hidden thoughts contain a ton of nonsense, typos and ungrammatical phrases. A bit of English and even French is mixed in. They read like the output of a pretty small open source model that has not seen much German or chess.
Playing badly too.
Because I just stumbled upon this article. Here is Melanie Mitchell's version of this point:
To me, this is reminiscent of the comparison between computer and human chess players. Computer players get a lot of their ability from the amount of look-ahead search they can do, applying their brute-force computational powers, whereas good human chess players actually don’t do that much search, but rather use their capacity for abstraction to understand the kind of board position they’re faced with and to plan what move to make.
The better one is at abstraction, the less search one has to do.
The point I was trying to make certainly wasn't that current search implementation necessarily look at every possibility. I am aware that they are heavily optimised, I have implemented Alpha-Beta-Pruning myself.
My point is that humans use structure that is specific to a problem and potentially new and unique to narrow down the search space. None of what currently exists in search pruning compares even remotely.
Which is why all these systems use orders of magnitude more search than humans (even those with Alpha-Beta-Pruning). And this is also why all these systems are narrow enough that you can exploit the structure that is always there to optimise the search.
No one really knew why tokamaks were able to achieve such impressive results. The Soviets didn’t progress by building out detailed theory, but by simply following what seemed to work without understanding why. Rather than a detailed model of the underlying behavior of the plasma, progress on fusion began to take place by the application of “scaling laws,” empirical relationships between the size and shape of a tokamak and various measures of performance. Larger tokamaks performed better: the larger the tokamak, the larger the cloud of plasma, and the longer it would take a particle within that cloud to diffuse outside of containment. Double the radius of the tokamak, and confinement time might increase by a factor of four. With so many tokamaks of different configurations under construction, the contours of these scaling laws could be explored in depth: how they varied with shape, or magnetic field strength, or any other number of variables.
Hadn't come across this analogy to current LLMs. Source: This interesting article.
Case in point: This is a five year old tsne plot of word vectors on my laptop.
I don't get what role the "gaps" are playing in this.
Where is it important for what a tool is, that it is for a gap and not just any subproblem? Isn't a subproblem for which we have a tool never a gap?
Or maybe the other way around: Aren't subproblem classes that we are not willing to leave as gaps those we create tools for?
If I didn't know about screwdrivers I probably wouldn't say "well, I'll just figure out how to remove this very securely fastened metal thing from the other metal thing when I come to it".
I'd be very interested to learn more about how your research agenda has progressed since that first post.
The post about learned lookahead in Leela has kind of galvanised me into finally finishing an investigation I have worked on for too long already. (Partly because I think that finding is incorrect, but also because using Leela is a great idea, I had got stuck with LLMs requiring a full game for each puzzle position).
I will ping you when I write it up.
Mira Murati said publicly that "next gen models" will come out in 18 months, so your confidential source seems likely to be correct.
This is a fairly straightforward point, but one I haven't seen written up before and I've personally been wondering a bunch about.
I feel like I have been going on about this for years. Like here, here or here. But I'd be the first to admit, that I don't really do effort posts.
Hmm, yeah, I think we are talking past each other.
Everything you describe is just pattern recognition to me. Lookahead or search does not depend on the broadness of the motive.
Lookahead, to me, is the ability to look ahead and see what is there. It allows very high certainty even for never before seen mating combinations.
If the line is forcing enough it allows finding very deep combinations (which you will never ever find with pattern recognition because the combinatorial explosions means that basically every deep combination has never been seen before).
In humans, it is clearly different from pattern recognition. Humans can see multi-move patterns in a glance. The example in the post I would play instantly in every blitz game. I would check the conditions of the pattern, but I wouldn't have to "look ahead".
Humans consider future moves even when intuitively assessing positions. "This should be winning, because I still have x,y and z in the position". But actually calculating is clearly different because it is effortful. You have to force yourself to do it (or at least I usually have to). You manipulate the position sequentially in your mind and see what could happen. This allows you to see many things that you couldn't predict from your past experience in similar positions
I didn't want to get hung up on whether there is a crisp boundary. Maybe you are right and you just keep generalising and generalising until there is a search algo in the limit. I very much doubt this is where the ability of humans to calculate ahead comes from. In transformers? Who knows.
I don't think my argument relies on the existence of a crisp boundary. Just on the existence of a part of the spectrum that clearly is just pattern recognition and not lookahead but still leads to the observations you made.
Here is one (thought) experiment to tease this apart: Imagine you train the model to predict whether a position leads to a forced checkmate and also the best move to make. You pick one tactical motive and erase it from the checkmate prediction part of the training set, but not the move prediction part.
Now the model still knows which the right moves are to make i.e. it would play the checkmate variation in a game. But would it still be able to predict the checkmate?
If it relies on pattern recognition it wouldn't - it has never seen this pattern be connected to mate-in-x. But if it relies on lookahead, where it leverages the ability to predict the correct moves and then assesses the final position then it would still be able to predict the mate.
The results of this experiment would also be on a spectrum from 0% to 100% of correct checkmate-prediction for this tactical motive. But I think it would be fair to say that it hasn't really learned lookahead for 0% or a very low percentage and that's what I would expect.
I think the spectrum you describe is between pattern recognition by literal memorisation and pattern recognition building on general circuits.
There are certainly general circuits that compute whether a certain square can be reached by a certain piece on a certain other square.
But if in the entire training data there was never a case of a piece blocking the checkmate by rook h4, the existence of a circuit that computes the information that the bishop on d2 can drop back to h6 is not going to help the "pattern recognition"-network to predict that Ng6 is not a feasible option.
The "lookahead"-network however would go through these moves and assess that 2.Rh4 is not mate because of 2...Bh6. The lookahead algorithm would allow it to use general low-level circuits like "block mate", "move bishop/queen on a diagonal" to generalise to unseen combinations of patterns.
I thought I spell this out a bit:
What is the difference between multi-move pattern recognition and lookahead/search?
Lookahead/search is a general algorithm, build on top of relatively easy-to-learn move prediction and position evaluation. In the case of lookahead this general algorithm takes the move prediction and goes through the positions that arise when making the predicted moves while assessing these positions.
Multi-move pattern recognition starts out as simple pattern recognition: The network learns that Ng6 is often a likely move when the king is on h8, the queen or bishop takes away the g8 square and there is a rook or queen ready to move to the h-file.
Sometimes it will predict this move although the combination, the multi-move tactical strike, doesn't quite work, but over time it will learn that Ng6 is unlikely when there is a black bishop on d2 ready to drop back to h6 or when there is a black knight on g2, guarding the square h4 where the rook would have to checkmate.
This is what I call multi-move pattern recognition: The network has to learn a lot of details and conditions to predict when the tactical strike (a multi-move pattern) works and when it doesn't. In this case you would make the same observations that where described in this post: For example if you'd ablate the square h4, you'd lose the information of whether this will be available for the rook in the second move. It is important for the pattern recognition to know where future pieces have to go.
But the crucial difference to lookahead or search is that this is not a general mechanism. Quite the contrary, it is the result of increasing specialisation on this particular type of position. If you'd remove a certain tactical pattern from the training data the NN would be unable to find it.
It is exactly the ability of system 2 thinking to generalise much further than this that makes the question of whether transformers develop it so important.
I think the methods described in this post are even in principle unable to distinguish between multi-move pattern recognition and lookahead/search. Certain information is necessary to make the correct prediction in certain kinds of positions. The fact that the network generally makes the correct prediction in these types of positions already tells you that this information must be processed and made available by the network. The difference between lookahead and multi-move pattern recognition is not whether this information is there but how it got there.
Very cool project!
My impression so far was that transformer models do not learn search in chess and you are careful to only speak about lookahead. I would suggest that even that is not necessarily the case: I suspect the models learn to recognise multi-move patterns. I.e. they recognise positions that allow certain multi-move tactical strikes.
To tease search/lookahead and pattern recognition apart I started creating a benchmark with positions that are solved by surprising and unintuitive moves, but I really didn't have any time to keep working on this idea and it has been on hold for a couple of months.
I think all the assumptions that go into this model are quite questionable, but it's still an interesting thought.
I tried some chess but's it's still pretty bad. Not noticeably better GPT4.
In my case introspection lead me to the realisation that human reasoning consists to a large degree out of two interlocking parts: Finding constraint of the solution space and constraint satisfaction.
Which has the interesting corollary that AI systems that reach human or superhuman performance by adding search to NNs are not really implementing reasoning but rather brute-forcing it.
It also makes me sceptical that LLMs+search will be AGI.
In psychometrics this is called "backward digit span".
Diminishing returns in loss are not diminishing returns in capabilities. And benchmarks tend to saturate, so diminishing returns are baked in if you look at those.
I am not saying that there aren't diminishing returns to scale, but I just haven't seen anything definitive yet.
Frankly, I don't really understand what you are saying here and I am open to the possibility that I don't really understand how the gradient works in autoregressive transformers.
But as I said in my other comment, my current understanding is:
In standard attention (for example in an encoder) tokens are not ordered, so it is clear that the gradient of the loss of one of the token predictions (for example a masked token in BERT) flows through all other tokens equally. In autoregressive transformers an order is imposed by masking, but all later tokens attend to all earlier tokens in the same way.
The gradient of the loss of a later tokens flows through all earlier tokens in the same way. It doesn't matter whether a token is half the context back or all the context, neither for the information flow nor for the gradient flow.
To put it another way: In the n-th layer the last token attends to all the output tokens from the n-1-th layer. It doesn't somehow have to make do with the output of earlier layers for tokens that are further back.
Yeah, the first 99 tokens would be optimized both to be locally the correct character, and also to set things up so that the 100th character is also correct.
That is how LLMs currently work. The gradient of each token prediction does flow back into all the earlier tokens whose information was integrated into the predicted token. So each token optimizes its own next token prediction but also tries to integrate the information that is most useful for future tokens.
I don't know how people are creating huge context windows these days, but IIRC the way it works is that the longer you look back into your context (and correspondingly the further you are trying to plan ahead) the less of your computation is available. Like, if you have N layers, then for a token M steps back, you only have access to the computation up until layer N-M.
Everything in the context window is equally available. It doesn't make a difference whether an earlier token is 5 tokens back or 5000. The attention mechanism is an operation over a set of tokens, there is no intrinsic order.
- Scaling curves show strongly diminishing returns to $ spend: A $100m model might not be that far behind a $1b model, performance wise.
What's your argument for that?
Hah, I didn't see your answer but our links complement nicely.
I think my first link was the paper that was making some waves when it came out.
https://www.sciencedirect.com/science/article/pii/S0004370220301855#se0050
https://www.sciencedirect.com/science/article/pii/S0004370221000722
This reminds me a lot of a toy project I have in the back of my mind but will probably never get around to:
Which is to train a transformer on the sequences generated by the logic models from the apperception engine paper (which in the paper are inferred by the apperception engine from the sequences) with the aim of predicting the logic model.
The wikipedia page on the technological singularity has some background on that. https://en.wikipedia.org/wiki/Technological_singularity
About 1.) That GPT4 performance jumps most in non-char tests seems to point towards two sources of difficulty in H-tests, with one being tokenization hiding char-level information.
About 2.) To me your results look completely consistent with scale solving H-tests. There are many benchmarks where a certain level has to be reached to leave random performance behind. For your benchmark that level is pretty high, but Claude and GPT4 seem to be above it.
If it's not scale, what makes Claude and GPT4 capable of making a dent in your benchmark?
About 3.) Finetuning doesn't convey enough information to completely revamp the representation of the spelling of different tokens. Finetuning mostly doesn't teach models skills they don't have. It instead points them squarely at the task they should be doing.
To me the most interesting question is to what extend your network learns to do reasoning/search vs pure pattern recognition.
I trained a transformer to predict tournament chess moves a few years back and my impression was that it played strongly in the opening and always made sensible looking moves but had absolutely no ability to look ahead.
I am currently working on a benchmark of positions that require reasoning and can't be solved by highly trained intuition alone. Would you be interested in running such a benchmark?
99.2% square accuracy is consistent with 50% position accuracy. Did you check position accuracy?
There was another Chess-GPT investigation into that question recently by Adam Karvonen: https://adamkarvonen.github.io/machine_learning/2024/01/03/chess-world-models.html
The linear probe accuracy for the board state actually peaks in the sixth layer (out of eight). To predict the next move it already discards some of the state information. Well, maybe that is unsurprising.
It also doesn't reach terribly impressive accuracy. 99.2% sounds a lot, but it is per square, which means it might get a square wrong in every second position.
I think more important than how easy it is to extract the information, is how necessary it is to extract the information. You can probably be somewhat fuzzy about board state details and still get great accuracy.
There is a step in move prediction where you go beyond the intuitive move selection and have to calculate to find deeper reasons for and against moves. This feels similar to me to attending to your uncertainty about the placement of particular pieces beyond the immediate necessity. And all these models have not taken this step yet.
I'm actually doing an analysis right now to nail it down that GPTs don't calculate ahead when trained on move prediction and stay completely in the intuitive move selection regime, but it's not easy to separate intuitive move selection and calculation in a bulletproof way.
System 2 thinking that takes a detour over tokens is fundamentally limited compared to something that continuously modifies a complex and highly dimensional vector representation.
Integration of senses will happen, but is the information density of non-text modalities high enough to contribute to the intelligence of future models?
What I call "learning during problem solving" relies on the ability to extract a lot of information from a single problem. To investigate and understand the structure of this one problem. To in the process of doing that building a representation of this problem that can be leveraged in the future to solve problems that have similar aspects.
I think you have defined me as not really talking as I am on the autism spectrum and have trouble telling emotions from tone.
No, he didn't. Talking is not listening and there's a big difference between being bad at understanding emotional nuance because of cognitive limitations and the information that would be necessary for understanding emotional nuance never even reaching you brain.
Was Stephen Hawking able to talk (late in life)? No, he wasn't. He was able to write and his writing was read by a machine. Just like GPT4.
If I read a book to my daughter, does the author talk to her? No. He might be mute or dead. Writing and then having your text read by a different person or system is not talking.
But in the end, these are just words. It's a fact that GPT4 has no control over how what it writes is read, nor can it hear how what it has written is being read.
If the entire representation of a complex task or problem is collapsed into a text, reading that text and trying to push further is not really "reasoning across calls". I expect that you can go further with that, but not much further. At least that's what it looks like currently.
I don't think you can learn to solve very specific complex problems with the kind of continuous learning that would be possibly to implement with current models. Some of the theorem-prover papers have continuous learning loops that basically try to do this but those still seem very inefficient and are applied to only highly formalised problems whose solutions can be automatically verified.
Yes, multi-modality is not a hard limitation.
I know these approaches and they don't work. Maybe they will start working at some point, but to me very unclear when and why that should happen. All approaches that use recurrence based on token-output are fundamentally impoverished compared to real recurrence.
About multi-modality:
I expect these limitations to largely vanish as models are scaled up and trained end-to-end on a large variety of modalities.
Yes, maybe Gemini will be able to really hear and see.
Your views were called "morally atrocious" because you stated that human extinction would not necessarily be bad. Seems very clear from the context in the comment frankly.
I agree that massive population growth would also be dangerous. We have that in Africa, so I worry about it for Afrika. We don't have it anywhere else, so I don't worry about it for any other place.
Empirically, resource wars are much less likely than internecine ethnic strife.
After we have automated much of the economy, there won't be side effects on the economy. The trick is actually getting there.
I don't know what Zvi and Robin Hanson would celebrate, but I personally worry about fast population decline in those "geographical/cultural pockets" that are responsible for scientific and technological progress.
And I worry because I see the possibility that the decline of innovation and tech will not be as gradual as even fast population decline generally is, but that this decline will be exacerbated by the political instability and/or political sclerosis that comes from two many old people / too much immigration + a shrinking pie.
It is the change that is bad, not necessarily the future total size of the population.
Edit: Maybe I should unpack that a bit. I also think more people is better, because life is good and innovation is proportional to the number of innovators, but apart from that:
A decreasing population leads to economic stagnation and innovation slowdown. Both can be observed in Japan. South Korea, China, Taiwan are on track to tank their population much faster than Japan ever did. Hows that going to work out for them?
In a permanent recession will investment dry up killing whatever dynamism there might still be?
If the age pyramid is inverted old people have too much political power for the country to ever reverse course and support the young towards family formation.
If you allow massive immigration to fix the labor shortage you also invite ethnic strife down the line. Almost all violent conflicts are based on having two or more ethnic groups within one country.
Will young people emigrate if they are burdened with caring for too many old people in a shrinking economy?
My view is that the progress we observe in the last centuries is more fragile than it seems and it is certainly possible that we will kill it almost completely if we continue to remove or weaken many of the preconditions for it.
Does the post ever mention the target of growing the population? I only recall mentions of replacement fertility.