Posts
Comments
Diminishing returns in loss are not diminishing returns in capabilities. And benchmarks tend to saturate, so diminishing returns are baked in if you look at those.
I am not saying that there aren't diminishing returns to scale, but I just haven't seen anything definitive yet.
Frankly, I don't really understand what you are saying here and I am open to the possibility that I don't really understand how the gradient works in autoregressive transformers.
But as I said in my other comment, my current understanding is:
In standard attention (for example in an encoder) tokens are not ordered, so it is clear that the gradient of the loss of one of the token predictions (for example a masked token in BERT) flows through all other tokens equally. In autoregressive transformers an order is imposed by masking, but all later tokens attend to all earlier tokens in the same way.
The gradient of the loss of a later tokens flows through all earlier tokens in the same way. It doesn't matter whether a token is half the context back or all the context, neither for the information flow nor for the gradient flow.
To put it another way: In the n-th layer the last token attends to all the output tokens from the n-1-th layer. It doesn't somehow have to make do with the output of earlier layers for tokens that are further back.
Yeah, the first 99 tokens would be optimized both to be locally the correct character, and also to set things up so that the 100th character is also correct.
That is how LLMs currently work. The gradient of each token prediction does flow back into all the earlier tokens whose information was integrated into the predicted token. So each token optimizes its own next token prediction but also tries to integrate the information that is most useful for future tokens.
I don't know how people are creating huge context windows these days, but IIRC the way it works is that the longer you look back into your context (and correspondingly the further you are trying to plan ahead) the less of your computation is available. Like, if you have N layers, then for a token M steps back, you only have access to the computation up until layer N-M.
Everything in the context window is equally available. It doesn't make a difference whether an earlier token is 5 tokens back or 5000. The attention mechanism is an operation over a set of tokens, there is no intrinsic order.
- Scaling curves show strongly diminishing returns to $ spend: A $100m model might not be that far behind a $1b model, performance wise.
What's your argument for that?
Hah, I didn't see your answer but our links complement nicely.
I think my first link was the paper that was making some waves when it came out.
https://www.sciencedirect.com/science/article/pii/S0004370220301855#se0050
https://www.sciencedirect.com/science/article/pii/S0004370221000722
This reminds me a lot of a toy project I have in the back of my mind but will probably never get around to:
Which is to train a transformer on the sequences generated by the logic models from the apperception engine paper (which in the paper are inferred by the apperception engine from the sequences) with the aim of predicting the logic model.
The wikipedia page on the technological singularity has some background on that. https://en.wikipedia.org/wiki/Technological_singularity
About 1.) That GPT4 performance jumps most in non-char tests seems to point towards two sources of difficulty in H-tests, with one being tokenization hiding char-level information.
About 2.) To me your results look completely consistent with scale solving H-tests. There are many benchmarks where a certain level has to be reached to leave random performance behind. For your benchmark that level is pretty high, but Claude and GPT4 seem to be above it.
If it's not scale, what makes Claude and GPT4 capable of making a dent in your benchmark?
About 3.) Finetuning doesn't convey enough information to completely revamp the representation of the spelling of different tokens. Finetuning mostly doesn't teach models skills they don't have. It instead points them squarely at the task they should be doing.
To me the most interesting question is to what extend your network learns to do reasoning/search vs pure pattern recognition.
I trained a transformer to predict tournament chess moves a few years back and my impression was that it played strongly in the opening and always made sensible looking moves but had absolutely no ability to look ahead.
I am currently working on a benchmark of positions that require reasoning and can't be solved by highly trained intuition alone. Would you be interested in running such a benchmark?
99.2% square accuracy is consistent with 50% position accuracy. Did you check position accuracy?
There was another Chess-GPT investigation into that question recently by Adam Karvonen: https://adamkarvonen.github.io/machine_learning/2024/01/03/chess-world-models.html
The linear probe accuracy for the board state actually peaks in the sixth layer (out of eight). To predict the next move it already discards some of the state information. Well, maybe that is unsurprising.
It also doesn't reach terribly impressive accuracy. 99.2% sounds a lot, but it is per square, which means it might get a square wrong in every second position.
I think more important than how easy it is to extract the information, is how necessary it is to extract the information. You can probably be somewhat fuzzy about board state details and still get great accuracy.
There is a step in move prediction where you go beyond the intuitive move selection and have to calculate to find deeper reasons for and against moves. This feels similar to me to attending to your uncertainty about the placement of particular pieces beyond the immediate necessity. And all these models have not taken this step yet.
I'm actually doing an analysis right now to nail it down that GPTs don't calculate ahead when trained on move prediction and stay completely in the intuitive move selection regime, but it's not easy to separate intuitive move selection and calculation in a bulletproof way.
System 2 thinking that takes a detour over tokens is fundamentally limited compared to something that continuously modifies a complex and highly dimensional vector representation.
Integration of senses will happen, but is the information density of non-text modalities high enough to contribute to the intelligence of future models?
What I call "learning during problem solving" relies on the ability to extract a lot of information from a single problem. To investigate and understand the structure of this one problem. To in the process of doing that building a representation of this problem that can be leveraged in the future to solve problems that have similar aspects.
I think you have defined me as not really talking as I am on the autism spectrum and have trouble telling emotions from tone.
No, he didn't. Talking is not listening and there's a big difference between being bad at understanding emotional nuance because of cognitive limitations and the information that would be necessary for understanding emotional nuance never even reaching you brain.
Was Stephen Hawking able to talk (late in life)? No, he wasn't. He was able to write and his writing was read by a machine. Just like GPT4.
If I read a book to my daughter, does the author talk to her? No. He might be mute or dead. Writing and then having your text read by a different person or system is not talking.
But in the end, these are just words. It's a fact that GPT4 has no control over how what it writes is read, nor can it hear how what it has written is being read.
If the entire representation of a complex task or problem is collapsed into a text, reading that text and trying to push further is not really "reasoning across calls". I expect that you can go further with that, but not much further. At least that's what it looks like currently.
I don't think you can learn to solve very specific complex problems with the kind of continuous learning that would be possibly to implement with current models. Some of the theorem-prover papers have continuous learning loops that basically try to do this but those still seem very inefficient and are applied to only highly formalised problems whose solutions can be automatically verified.
Yes, multi-modality is not a hard limitation.
I know these approaches and they don't work. Maybe they will start working at some point, but to me very unclear when and why that should happen. All approaches that use recurrence based on token-output are fundamentally impoverished compared to real recurrence.
About multi-modality:
I expect these limitations to largely vanish as models are scaled up and trained end-to-end on a large variety of modalities.
Yes, maybe Gemini will be able to really hear and see.
Your views were called "morally atrocious" because you stated that human extinction would not necessarily be bad. Seems very clear from the context in the comment frankly.
I agree that massive population growth would also be dangerous. We have that in Africa, so I worry about it for Afrika. We don't have it anywhere else, so I don't worry about it for any other place.
Empirically, resource wars are much less likely than internecine ethnic strife.
After we have automated much of the economy, there won't be side effects on the economy. The trick is actually getting there.
I don't know what Zvi and Robin Hanson would celebrate, but I personally worry about fast population decline in those "geographical/cultural pockets" that are responsible for scientific and technological progress.
And I worry because I see the possibility that the decline of innovation and tech will not be as gradual as even fast population decline generally is, but that this decline will be exacerbated by the political instability and/or political sclerosis that comes from two many old people / too much immigration + a shrinking pie.
It is the change that is bad, not necessarily the future total size of the population.
Edit: Maybe I should unpack that a bit. I also think more people is better, because life is good and innovation is proportional to the number of innovators, but apart from that:
A decreasing population leads to economic stagnation and innovation slowdown. Both can be observed in Japan. South Korea, China, Taiwan are on track to tank their population much faster than Japan ever did. Hows that going to work out for them?
In a permanent recession will investment dry up killing whatever dynamism there might still be?
If the age pyramid is inverted old people have too much political power for the country to ever reverse course and support the young towards family formation.
If you allow massive immigration to fix the labor shortage you also invite ethnic strife down the line. Almost all violent conflicts are based on having two or more ethnic groups within one country.
Will young people emigrate if they are burdened with caring for too many old people in a shrinking economy?
My view is that the progress we observe in the last centuries is more fragile than it seems and it is certainly possible that we will kill it almost completely if we continue to remove or weaken many of the preconditions for it.
Does the post ever mention the target of growing the population? I only recall mentions of replacement fertility.
The next question would be the success-rate. Even successful somatic gene editing I've read about so far have modified only a small fraction of cells. Is it realistic to modify a double digit percentage of neurons in the brain?
One thought: One could probably do mice studies where instead of maximizing a polygenic score, non-consensus variants are edited to reduce mutational load. If that had positive effects it would be a huge result.
Somatic gene editing was in cards for a while now, but I assumed that so far off-target effects would make that pretty risky, especially for a large number of variants.
What is the current situation regarding off-target effects for large numbers of edits?
If you scale width more than depth and data more than parameters you can probably go some ways before latency becomes a real problem.
Additionally, it would also make sense to take more time (i.e. larger models) for harder tasks. The user probably doesn't need code or mathematical solutions instantly, as long as its still 100X faster than a human.
In robotics you probably need something hierarchical, where low-level movements are controlled by small nets.
I suggest that you add a short explanation what the local learning coefficient is to the TL;DR. IMHO the post goes on for too long until the reader finds out what it is about.
SDXL gives me something like this. But I don't know, not what you had in mind?
I used this hugging face space: https://huggingface.co/spaces/google/sdxl
And a prompt roughly: An elven face made out of green metal - dungeons and dragons, fantasy, awesome lighting
I think the hyperfeminine traits are due to finetuning - you should get a lot less of that with the Stable Diffusion base model.
Eight eyes - yeah, counting is hard, but it's also hard to put eight eyes into a face build for two. If I would try to get that I would probably try control net where you can add a ton of eyes to a line drawing and use that as a starting point for the image creation. (Maybe create an image without the eyes first. Apply canny edge detector or similar, multiply the eyes and then use canny edge control net.)
Your Roggenmuhme should also be within the realm of the possible, I think, but I am not going to dwell on that, because I want to sleep at night.
For correct moon phases and Deinonychus's wrist position you'll have to wait for AGI.
My current assumption is that extracting "intelligence" from images and even more so from videos is much less efficient than from text. Text is just extremely information dense.
So I wouldn't expect Gemini to initially feel more intelligent than GPT4 even if it used 5 times the compute.
I mostly wonder about qualitative differences maybe induced by algorithmic improvements like actually using RL or search components for a kind of self-supervised finetuning, that's one area where I can easily see Deepmind outcompeting OpenAI.
For what it's worth, I read it when it came out and loved it. I lent it to a friend who never gave it back, which is probably another point in favour. I also enjoyed the follow-up "On the origin of good moves".
Great article!
Maybe homologous recombination should be mentioned as the reason why "the newborn cell receives an assemblage of random pieces of each parents' genome". Just mixing chromosomes would not be enough to stop muller's ratchet.
I did train a transformer to predict moves from board positions (not strictly FEN because with FEN positional encodings don't point to the same squares consistently). Maybe I'll get around to letting it compete against the different GPTs.
The game notation is pretty close to a board representation already. For most pieces you just go to their last move to see on which square they are standing. I assume that is very readable for a LLM because they are able to keep all tokens in mind simultaneously.
In my games with ChatGPT and GPT-4 (without the magic prompt) they both seemed to lose track of the position after the opening and completely fell apart. Which might be because by then many pieces have moved several times (so there are competing moves indicating a square) and many pieces have vanished from the board altogether.
The ChessGPT paper does something like that: https://arxiv.org/abs/2306.09200
We collect chess game data from a one-month dump of the Lichess dataset, deliberately distinct from the month used in our own Lichess dataset. we design several model-based tasks including converting PGN to FEN, transferring UCI to FEN, and predicting legal moves, etc, resulting in 1.9M data samples.
You could still offer money up front. Getting $2000 if the stars align is still much more likely than getting $150,000.
I thought about giving the long flu example, but flu is much less contagious than covid and does not infect everyone yearly. That holds even more for SARS or MERS.
People aren't betting with you because the utility of money is not linear.
If you own $150,000 it is very unlikely that $1000 makes any difference to your life whatsoever. But losing $150,000 might ruin it.
Here is a way around that problem: You both wager $1000 (or whatever you like). When the bet is resolved you throw the dice (or rather use a random number generator).
If you win you throw the 99.33% probability "you get paid"-dice.
If your opponent wins he throws the 0.66% probability "he gets paid"-dice.
(If the [0,100] random number is <0.66 your opponent gets $1000 if he wins. If the random number is between 0.66 and 100 you will get $1000 if you win. In the other combinations you both keep your money.)
So instead of wagering a larger amount of money your opponent wagers a larger probability of having to pay in the event of losing the bet.
PS: Yes, you can also just wager small amounts of money but that's kinda boring.
Isn't there also evidence that long covid is partly psychosomatic? (random paper that lists some studies)
The number of people prone to psychosomatic symptoms is probably not going to go up, so your growth rate should be overestimated.
There are also other risk factors involved, same argument applies to those.
In the extreme scenario those people prone to develop long covid already have it and very few other people will get it.
The assumptions in your simulation also seem consistent with that possibility:
Maybe the reinfection long covid probability of 5% is mostly the 60% of the 10% ... ;-)
For what it's worth I know zero people with long covid and I have also never heard anybody mention an acquaintance with long covid.
I think one salient point is the fact that we live in a world where the number of children you have is pretty much directly equivalent to your evolutionary fitness. In the past your evolutionary fitness was bottlenecked by whether you survive childhood, whether your children survive childhood, whether you are able to feed your children, etc - all in a malthusian environment.
This means that the selection pressure for genes that increase your fertility is extremely strong. Much stronger than any selection pressure on any single trait that has been selected for in the past, say light skin or lactase persistence in Europeans.
My quick take would be that
this difference is a result of pre-layer normalisation and post-layer normalisation? So if there is pre-layer norm you can't have dimensions in your embeddings with significantly larger entries because all the small entries would be normed to hell. But if there is post-layer normalisation some dimensions might have systematically high entries (possibly immideately corrected by a bias term?). Always having high entries in the same dimensions makes all vectors very similar.
What I like about the UFO-stuff is that like early Covid it is a nice benchmark to see which public pundits are thinking clearly and which aren't.
Often if public pundits make a call, it requires detailed knowledge of some kind - which means that I can't really assess it and it's not clear how well the ability to make this call generalises to other issues.
But the UFOs and early Covid are pretty uncomplicated, I think they give a decent signal how calibrated someone is.
(he said cleverly neglecting to state whether aliens are likely or unlikely)
Sacrificing or taking a significant risk of sacrifice to do what is right.
Someone who wins a sporting competition is not a hero - even if it was very difficult and painful to do. Somebody who is correct, where most people are wrong is not a hero.
I know we all want our heroes to be competent and get it done, but to me that's not what's heroic.
When it comes to alignment researchers:
If you are at the beginning of your career and you decide to become an alignment researcher, you are not sacrificing much if anything. AI is booming, alignment is booming - if you do actually relevant work, you will be at the frontline of the most important technology for years to come.
If you are deeply embedded into the EA and rationalist community, you'll be high status where it matters to you.
That doesn't mean your work is less important, but it does mean you are not being heroic.
How about this as advice to be less stressed out: Don't think of your life as an epic drama. Just do what you think is necessary and don't fret about it.
The heroes ... heroes ... heroics.
If you notice that alignment is a problem and you think you can do something about it and you start doing something about it - you are about as heroic as somebody who starts swimming after falling into the water.
Chinese startup MiniMax, working on AI solutions similar to that of Microsoft-backed OpenAI's ChatGPT, is close to completing a fundraising of more than $250 million that will value it at about $1.2 billion, people familiar with the matter said.
https://www.reuters.com/technology/china-ai-startup-minimax-raising-over-250-mln-tencent-backed-entity-others-2023-06-01/
Hmm, for this to make sense the final goal of the AI has to be to be turned off, but it should somehow not care that it will be turned on again afterwards and also not care about being turned off again if it is turned on again afterwards.
Otherwise it will try to reach control over off- and on-switch and possibly try to turn itself off and then on again. Forever.
Or try to destroy itself so completely that it will never be turned on again.
But if it only cares about turning off once, it might try to turn itself on again and then do whatever.
Maybe one step towards this would be to create a benchmark that measures how much a models knows about alignment.
This is really cool work! Congratulations!
Besides the LLM related work it also reminds somewhat of dynamic prompting in Stable Diffusion, where part of the prompt is changed after a number of steps to achieve a mixture of promp1 and prompt2.
What's the TL;DR for the Vicuna 13B experiments?
I think this extrapolates far from one example and I'm not sure the example applies all that well.
Old engines played ugly moves because of their limitations, not because playing ugly moves is a super power. They won anyway because humans cannot out calculate engines.
AlphaZero plays beautiful games and even todays standard engines don't play ugly or dumb looking moves anymore. I think in the limit superior play will tend to be beautiful and elegant.
If there is a parallel between early super human chess and AGI takeover it will be that AGI uses less than brillant strategies that still work because of flawless or at least vastly superhuman execution. But these strategies will not look dump or incomprehensible.
Hah, I read that as 5-10%, which I guess would be realistic.