Posts
Comments
I agree, I definitely underestimated video. Before publishing, I had a friend review my predictions and they called out video as being too low, and I adjusted upward in response and still underestimated it.
I'd now agree with 2026 or 2027 for coherent feature film length video, though I'm not sure if it would be at feature film artistic quality (including plot). I also agree with Her-like products in the next year or two!
Personally I would still expect cloud compute to still be used for robotics, but only in ways where latency doesn't matter (like a planning and reasoning system on top of a smaller local model, doing deeper analysis like "There's a bag on the floor by the door. Ordinarily it should be put away, but given that it wasn't there 5 minutes ago, it might be actively used right now, so I should leave it..."). I'm not sure the privacy concerns will trump convenience, like with phones.
I also now think virtual agents will start to become a big thing in 2025 and 2026, doing some kinds of remote work, or sizable chucks of existing jobs autonomously (while still not being able to automate most jobs end to end)!
One year and 3 months on, I'm reviewing my predictions! Overall, I mark 13 predictions as true or mostly true, 6 as false or mostly false, and 3 as debatable.
Rest of 2023
- Small improvements to LLMs
- Google releases something competitive to ChatGPT.
- Mostly True | Google had already released Bard at the time, which sucked, but this was upgraded to Gemini and relaunched in December 2023. Gemini Ultra wasn’t released until February 2024 though, so points off for that.
- Anthropic and OpenAI slightly improve GPT-4 and Claude2
- True | GPT-4 Turbo and Claude 2.1 were both released in November 2023.
- Meta or another group releases better open source models, up to around GPT-3.5 level.
- False | Llama 2 had already been released at this time, and was nearly as good as GPT-3.5, but no other GPT-3.5-or-better open source models came out in 2023.
- Google releases something competitive to ChatGPT.
- Small improvements to Image Generation
- Dalle3 gets small improvements.
- Debatable | This is a really lukewarm prediction. Small changes were made to Dalle3 in the rest of 2023, integrating with GPT-4 prompting, for example, though there were complaints they made it worse in an attempt to avoid copyright issues when it was integrated with Bing.
- Google or Meta releases something similar to Dalle3, but not as good.
- Mostly True | Google released Imagen 2 in December 2023, which was about as good as DALL-E 3. I don’t know how much I should penalise myself for it being about as good, rather than ‘not as good’.
- Dalle3 gets small improvements.
- Slight improvements to AI generated videos.
- Basic hooking up of Dalle3 to video generation with tagged on software, not really good consumer stuff yet. Works in an interesting way, like Dalle1, but not useful for much yet.
- True | Lots of people played around with making videos by stepping through frames made in DALL-3, and they mostly weren’t very good! Pika 1.0 came out in December 2023, but it also wasn’t that great.
- Basic hooking up of Dalle3 to video generation with tagged on software, not really good consumer stuff yet. Works in an interesting way, like Dalle1, but not useful for much yet.
- Further experiments hooking LLMs up to robotics/cars, but nothing commercial released.
- True | Figure AI is the most notable example of hooking up LLMs to robotics, and they did some experiments in late 2023 with GPT-4. As far as I know there wasn’t any commercial release of an LLM-enabled robot anywhere.
- Small improvements in training efficiency and data usage, particularly obviously in smaller models becoming more capable than older, larger ones.
- True | Mistral 7B was notable here, being smaller and more capable than some of the earlier, much larger models like BLOOM 176B (as far as I can tell).
Since those ‘Rest of 2023’ predictions were only for three months in the future, most of them were very trivial to get right -- of course models would get better! Let’s see how predictions further out did:
2024:
- GPT-5 or equivalent is released.
- It’s as big a jump on GPT-4 as GPT-4 was on GPT-3.5.
- Mostly True | While they aren’t named GPT-5, the best released models today are as big an improvement over GPT-4 as GPT-4 was over GPT-3.5 as far as benchmarks can tell. Here’s a comparison table of GPT-3.5 and GPT-4, compared with the best released open weights model (DeepSeek V3), the best released close weights models (Claude Sonnet 3.5 (New) and o1),and the best known unreleased model (o3).
- It’s as big a jump on GPT-4 as GPT-4 was on GPT-3.5.
GPT-3.5 | GPT-4 | DeepSeek-V3 (Open Weights) | Sonnet 3.5 (New) | o1 | o3 | |
Context Length | 16k | 8k | 128k | 200k | 128k | / |
HumanEval | 48.1% | 67% | / | 93.7% | / | / |
ARC-AGI | <5% [1] | <5% [1] | / | 20.3% | 32% | 88% |
SWE-bench Verified | 0.4% [2] | 2.8% 22.4% [3] | 42.0% | 49.0% 53.0% [4] | 48.9% | 71.7% |
Codeforces [5] | 260 ~1.5% | 392 4.0% | ~1550 51.6% | ~1150 20.3% | 1891 ~91.0% | 2727 ~99.3% |
GPQA Diamond | / | 33.0% | 59.1% | 58.0% 65.0% [6] | 78.0% | 87.7% |
MATH | / | 52.9% | 90.2% | 78.3% | 94.8% | / |
MMLU | 70.0% | 86.4% | 88.5% | 88.3% | 92.3% | / |
DROP | 64.9 | 80.9 | 91.6 | 87.1 | / | / |
GSM8K | 57.1% | 92.0% | / | 96.4% | / | / |
[1] From ARC Prize “In this method, contestants use a traditional LLM (like GPT-4) and rely on prompting techniques to solve ARC-AGI tasks. This was found to perform poorly, scoring <5%.”
[2] 0.4% with RAG, tested October 2023
[3] 2.8% with RAG, 22.4% with ‘SWE-agent’ structure, tested April 2024.
[4] 49.0% in launch paper, 53.0% on SWE-Verified’s leaderboard with OpenHands + CodeAct v2.1
[5] Sometimes scores were giving as a rating, and sometimes as a percentile. They have been converted to match.
[6] 58% published on Epoch AI, 65% claimed in release paper. Likely different assessment (CoT, best of N, etc).
https://x.com/OpenAI/status/1870186518230511844
https://openai.com/index/learning-to-reason-with-llms/
https://www.anthropic.com/news/3-5-models-and-computer-use
https://arxiv.org/pdf/2303.08774v5
- --
- Can do pretty much any task when guided by a person, but still gets things wrong sometimes.
- Debatable | It’s too vague to measure (“pretty much” and “wrong sometimes” -- seriously, what was I thinking). It doesn’t feel like the models can do “any task” in a way that GPT-4 couldn’t, but at the same time “pretty much” every benchmark for LLMs has been saturated, and I ask Claude for help with nearly everything. Agentic tasks can’t be done, but that’s covered by other predictions, and this prediction is about being “guided by a person”.
- Multimodal inputs, browsing, and agents based on it are all significantly better.
- Mostly True | The agent structures as well as the models have improved significantly, as you can see by the same models doing much better on SWE-Bench under newer structures, and by newer models still beating older ones.
- Can do pretty much any task when guided by a person, but still gets things wrong sometimes.
- Agents can do basic tasks on computers -- like filling in forms, working in excel, pulling up information on the web, and basic robotics control. This reaches the point where it is actually useful for some of these things.
- Debatable | I could see this being graded either way, depending on specific metrics. Claude with Computer Use can do all of these things (sans robotics control) but isn’t really useful. The individual tasks are usefully done by a mix of Gemini, ChatGPT, and Figure’s (GPT-4o?) robot control, but they aren’t really agents.
- Robotics and long-horizon agents still don’t work well enough for production. Things fall apart if the agent has to do something with too many branching possibilities or on time horizons beyond half an hour or so. This time period / complexity quickly improves as low-hanging workarounds are added.
- Mostly True | There are some production uses for Figure and Tesla’s robots, but these are more similar to traditional industrial robots doing a narrow task than to an agent.
- Context windows are no longer an issue for text generation tasks.
- Algorithmic improvements, or summarisation and workarounds, better attention on infinite context windows, or something like that solves the problem pretty much completely from a user’s perspective for the best models.
- Mostly False | Context windows aren’t nearly as limiting as they were in October 2023, growing from ~8k to ~128k, and RAG and other techiques helping models intelligently search files and add them to their own context, but it’s definitely not solved. Long outputs like novels still suck, and long inputs like giant codebases or regulations still lead to models missing key details a lot of the time.
- Algorithmic improvements, or summarisation and workarounds, better attention on infinite context windows, or something like that solves the problem pretty much completely from a user’s perspective for the best models.
- GPT-5 has the context of all previous chats, Copilot has the entire codebase as context, etc.
- Mostly False | Although it is close -- Cursor has coding agents that can intelligently search the codebase for the files they need based on a provided task and add them to their own context, and ChatGPT has a memory feature (which doesn’t work super well). Neither of these is the same thing as just having the previous chats and codebase in context though.
- This is later applied to agent usage, and agents quickly improve to become useful, in the same way that LLMs weren’t useful for everyday work until ChatGPT.
- Mostly False | Agents are not yet useful, outside of some narrow coding agents.
- Online learning begins -- GPT-5 or equivalent improves itself slowly, autonomously, but not noticeably faster than current models are improved with human effort and a training step. It does something like select its own data to train on from all of the inputs and outputs it has received, and is trained on this data autonomously and regularly (daily or more often).
- False | As far as I’m aware, nothing like this is happening.
- AI selection of what data to train on is used to improve datasets in general - training for one epoch on all data becomes less common, as some high quality or relevant parts of giant sets are repeated more often or allowed larger step size.
- Mostly False | The trend has continued to move towards quality over quantity for training data, but I’m not aware of anybody specifically using existing LLMs to select / rank / weight training data automatically. I’m also now aware high quality data was already being repeated more often in the training sets. I don’t think anything is happening with a dynamic learning rate based on anything other than the loss.
- Autonomous generation of data is used more extensively, especially for aligning base models, or for training models smaller than the best ones (by using data generated by larger models).
- True | But also fairly trivial, it’s super well known that people are training models off the filtered outputs of earlier ones, and in general synethic data is working really well, especially for instruction tuning and for ground-truth’d domains like maths and coding.
- Code writing is much better, and tie-ins to Visual Studio are better than GPT-4 is today, as well as having much better context.
- True | Cursor, a fork of Visual Studio, has pretty capable agents built in that use any model available via API that you like, and they work a lot better than manually pasting problems into ChatGPT did in October of 2023.
- Open source models as capable of GPT-4 become available.
True | Deepseek V3 is open weights* and has performance exceeding GPT-4 on most benchmarks. As is Mistral Large 2, and Llama 3.1 405B.
* It’s not entirely open source, as in, the code and data needed to train a copy is not available. But that’s not how ‘open source’ is being used regarding model weights, although I am personally trying to use clearer language now.
- Training and runtime efficiency improves by at least a factor of two, while hardware continues improvements on trend. This is because of a combination of -- datasets improved by AI curation and generation, improved model architecture, and improvements in hyperparameter selection, including work similar to the optimisations gained from discovering Chinchilla scaling laws.
- True | Deepseek V3 stands out here -- using only 37B active parameters (in a MoE architecture with 671B total), it achieves performance better than GPT-4’s, which is estimated to have more than 1700B. Deepseek V3 was also trained with only 2048 H800 GPUs for 2 months, compared with GPT-4’s estimated 15000 A100 GPUs for 3 months, several times higher.
You might be right -- and whether the per-dollar gains were higher or lower than expected would be interesting to know -- but I just don't have any good information on this! If I'd thought of the possibility, I would have added it in Footnote 23 as another speculation, but I don't think what I said is misleading or wrong.
For what it's worth, in a one year review from Jacob Steinhardt, increased investment isn't mentioned as an explanation for why the forecasts undershot.
10x per year for compute seems high to me. Naïvely I would expect the price/performance of compute to double every 1-2 years as it has been forever, with overall compute available for training big models being a function of that + increasing investment in the space, which could look more like one-time jumps. (I.e. a 10x jump in compute in 2024 may happen because of increased investment, but a 100x increase by 2025 seems unlikely.) But I am somewhat uncertain of this.
For parameters, I definitely think the largest models will keep getting bigger, and for compute to be the big driver of that -- but also I would expect improvements like mixture of experts models to continue, which effectively allow more parameters with less compute (because not all of the parameters are used at all times). Other techniques, like RLHF, also improve the subjective performance of models without increasing their size (i.e. getting them to do useful things rather than only predict what next word is most likely).
I guess my prediction here would be simply that things like this continue, so that in 2025 if you have X compute, you could get a better model in 2025 than you could in 2023. But you also could have 5x to 50x more compute in 2025, so you have the sum of those improvements!
It's obviously far cheaper to play with smaller models, so I expect lots of improvements will initially appear in models small-for-their-time.
Just my thoughts!
I wrote this late at night, so to clarify and expand a little bit;
- "Work on more than one time scale" I think is actually an interesting idea to dwell on for a second. Like, when a person is trying to solve a problem, they will often pace back and forth, or talk, etc. They don't have to do everything in one pass, somehow the complex computation which lets them see and move around can work on a very fast time scale, while other problem solving is going on simultaneously, and only starts to effect motor outputs later on. That's interesting. The spinal cord doing processing independent of the brain thing I mentioned is evident in this older series of (rather horrible) experiments with cats: https://www.jstor.org/stable/24945006
- On the 'smaller models with lower latency', we already now see models like Minstral-7b outperforming 30b parameter models because of improvements in data, architecture, and training. I expect this trend to continue. If the largest models are capable of operating a robot out of the box, I think you could take those outputs, and use them to train (or otherwise distill down) the larger model to a more manageable size, more specialised for the task.
- On the 'LLMs could do the parts with higher latency', just yesterday I saw somebody do something like this with GPT-4V, where they periodically uploaded a photograph of what was in front of them, and got GPT-4V to output instructions on how to find the super market (walk further forward, turn right, etc). Kind of worked, that's the sort of thing I was picturing here, leaving much more responsive systems to handle the low latency work, like balance, gripping, etc.
I'm somewhat skeptical that running out of text data will meaningfully slow progress. Today's models are so sample inefficient compared with human brains that I suspect there are significant jumps possible there.
Also, as you say;
- Synthetic text data might well be possible (especially for domains where you can test the quality of the produced text externally (e.g. programming)
- Reinforcement-learning-style virtual environments can also generate data (and not necessarily only physics based environments either -- it could be more like playing games or using a computer).
- And multimodal inputs gives us a lot more data too, and I think we've only really scratched the surface of multimodal transformers today.
I am honestly very surprised it became a front page post too! It totally is just speculation.
I tried to be super clear that these were just babbled guesses, and I was mainly just telling people to try to do same, rather than trusting my starting point here.
The other thing that surprised me is that there haven't been too many comments saying "this part is off", or "you missed trend X!". I was kind of hoping for that!
Agree on lower depth models being possible, a few other possibilities:
-
Smaller models with lower latency could be used, possibly distilled down from larger ones.
-
Compute improvements might make it practical onboard (like with Tesla's self-driving hardware inside the chest of their andriod).
-
New architectures could work on more than one time scale -- kind of like humans do. E.g. when we walk, not all of the processing is done in the brain. Your spinal cord can handle a tonne of it autonomously. (Will find source tomorrow).
-
LLM-type models could do the parts that can accept higher latency, leaving lower level processes to handle themselves. Imagine for a household cleaning robot that a LLM based agent puts out high level thoughts like "Scan the room for dirty clothes. ... Fold them. ... Put them in the third draw", and existing low level stuff actually carried out the instructions. That's an exaggerated example, but you get the idea, it doesn't have to replace the PID controller!
I am extremely worried about safety, but I don't know as much about it as I do about what's on the edge of consumer / engineering trends, so I think my predictions here would be not useful to share right now! The main way it relates to my guesses here is if regulation successfully slows down frontier development within a few years (which I would support).
I'm doing the ARENA course async online at the moment, and possibly moving into alignment research in the next year or two, so hoping to be able to chat more intelligently on alignment soonish.
I broadly agree. I think AI tools are already speeding up development today, and on reflection I don't actually think AI being more capable than humans at modeling the natural world would be a discontinuous point on the ramp up to superintelligence, actually.
It would be a point where AI gets much harder to predict, though, which is probably why it was on my mind when I was trying to come up with predictions.
Thanks, fixed. I did mean 3.5 to 4, not 3 to 4.
Side note -- France isn't a great example for your point here "France, for example, is a very old, well-established and liberal democracy." because the Fifth Republic was only established in 1958. It's also notable for giving the president much stronger executive powers compared with the Fourth Republic!
In the spirit of doing low status things with high potential, I am working on a site to allow commissioning of fringe erotica and am looking to hire a second web developer.
The idea is to build a place where people with niche interests can post bounties for specific stories. In my time moonlighting as an erotic author, I've noticed a lack of good sites to do freelance erotic writing work. I think the reason for this is that most people think porn is icky, so despite there being a huge market for extremely niche content, the platforms currently available are pretty abysmal. This is our opportunity.
We're currently in beta and can pay a junior-level wage, with senior-level equity. If you're a web developer who wants to join a fully remote startup, please reach out.
As with my other startups, I began this project with the goal of generating wealth to put towards alignment research.
Thanks Chris, but I think you linked to the wrong thing there, I can't see your post in the last 3 years of your history either!
Aye, I agree it is not a solution to avoiding power seeking, only that there may be a slightly easier target to hit if we can relax as many constraints on alignment as possible.
Will check them out, thank you.
I like this story pitch! It seems pretty compelling to me, and a clever way to show the difficulty and stakes of alignment. Good luck!
I am curious if this has changed over the past 6 years since you posted this comment. Do you get the feeling that high profile researchers have shifted even further towards Xrisk concern, or if they continue with the same views as in 2016? Thanks!
I took the original sentence to mean something like "we use things external to the brain to compute things too", which is clearly true. Writing stuff down to work through a problem is clearly doing some computation outside of the brain, for example. The confusion comes from where you draw the line -- if I'm just wiggling my fingers without holding a pen, does that still count as computing stuff outside the brain? Do you count the spinal cord as part of the brain? What about the peripheral nervous system? What about information that's computed by the outside environment and presented to my eyes? I think it's kind of an arbitrary line, but reading this charitably their statement can still be correct, I think.
(No response from me on the rest of your points, just wanted to back the author up a bit on this one.)
I really enjoy'd this writeup! I'd probably even go a little bit on the pessimistic (optimistic?) side, and bet that almost all of this technology would be possible with only a few years of development from today -- though I suppose it might be 20 if development doesn't start/ramp up in earnest.
Thanks!
That's a good point, I'll write up a brief explanation/disclaimer and put it in as a footnote.
Typo corrected, thanks for that.
I agree, it's more likely for the first AGI to begin on a supercomputer at a well-funding institution. If you like, you can imagine that this AGI is not the first, but simply the first not effectively boxed. Maybe its programmer simply implemented a leaked algorithm that was developed and previously run by a large project, but changed the goal and tweaked the safeties.
In any case, it's a story, not a prediction, and I'd defend it as plausible in that context. Any story has a thousand assumptions and events that, in sequence, reduce the probability to infinitesimal. I'm just trying to give a sense of what a takeoff could be like when there is a large hardware overhang and no safety -- both of which have only a small-ish chance of occurring. That in mind, do you have an alternative suggestion for the title?