Posts
Comments
Hallucination was a bad term because it sometimes included lies and sometimes included... well, something more like hallucinations. i.e. cases where the model itself seemed to actually believe what it was saying, or at least not be aware that there was a problem with what it was saying. Whereas in these cases it's clear that the models know the answer they are giving is not what we wanted and they are doing it anyway.
“In a few rigged demos, it even lies in more serious ways, like hiding evidence that it failed on a task, in order to get better ratings.”
Yeah tbh these misalignments are more blatant/visible and worse than I expected for 2025. I think they'll be hastily-patched one way or another by this time next year probably.
InverseGaussian[7.97413, 1.315]
Can you elaborate on where these numbers coming from? Eli's screenshot for the inversegaussian had parameters of 5.3743 and 18.9427.
Great post, I agree with everything you say in the first section. I disagree with your bottlenecks / amdahls law objection for reasons Ryan mentions; I think our analysis stands firm / takes those bottlenecks into account. (Though tbc we are very uncertain, more research is needed) As for hofstadters law, I think it is basically just the planning fallacy and yeah I think it's a reasonable critique that insofar as our AI timelines are basically formed by doing something that looks like planning, we probably have a bias we need to correct for. I want to think more about the extent to which out timelines methodology is analogous to planning.
Really cool stuff & important work, thank you!
I'm curious to hear what conclusions you think we would have came to & should come to. I'm skeptical that they would have been qualitatively different. Perhaps you are going to argue that we shouldn't put much credence in the superexponential model? What should we put it in instead? Got a better superexponential model for us? Or are you going to say we should stick to exponential?
Thanks for engaging, I'm afraid I can't join the call due to a schedule conflict but I look forward to hearing about it from Eli!
That's part of it, but also, over the course of 2027 OpenBrain works hard to optimize for data-efficiency, generalization and transfer learning ability, etc. and undergoes at least two major paradigm shifts in AI architecture.
To me, having confidence in the values of the model means that I trust the model to consistently behave in a way that is aligned with its values. That is, to maximize the enactment of its values in all current and future outputs.
It sounds like you are saying, you are confident in the values of a model iff you trust it to actually follow its values. But surely this isn't the whole story, there should be some condition about "and the values are actually good values" or "and I know what the values are" right? Consider a model that is probably alignment faking, such that you have no idea what its actual values are, all you know is that it's pretending to have the values it's being trained to have. It seems like you are saying you'd have confidence in this model's values even though you don't know what they are?
Their main effect will be to accelerate AI R&D automation, as best I can tell.
I think I'll wait and see what the summer looks like and then do another update to my timelines. If indeed the horizon length trend is accelerating already, it'll be clear by the summer & my timelines will shorten accordingly.
Great question! First of all, we formed our views on AI timelines and had mostly finished writing AI 2027 before this METR graph was published. So it wasn't causally relevant to our timelines.
Secondly, see this comment on the original METR graph in which I make the superexponential prediction. This is the most direct answer to your question.
Third, our timelines forecast discusses the exponential fit vs. superexponential fit and our reasoning; we actually put probability mass in both.
Fourth, new data points are already above the exponential trend.
I don't understand your point (a), it seems like a poor response to my point (a).
I agree with (b).
For (c), the models are getting pretty situationally aware and will get even more so... But yeah, your view is that they'll learn the right values before they learn sufficient situational awareness to alignment-fake? Plausible. But also plausibly not.
@evhub would you say Anthropic is aiming for something more like 2 or more like 3?
I totally agree with your point (b) and with the nervousness about how corrigible agents will behave out of distribution. Corrigible agents are dangerous in different ways than incorrigible agents. But again, the plan (which seems to be good to me) is to first build a corrigible agent so that you can then build an incorrigible agent and actually get it right, perfectly right. (Because if you build an incorrigible agent and get something wrong, you may not have a chance to correct it...)
Not only is that just one possible bias, it's a less-common bias than its opposite. Generally speaking, more people are afraid to stick their necks out and say something extreme than actively biased towards doing so. Generally speaking, being wrong feels more bad than being right feels good. There are exceptions; some people are contrarians, for example (and so it's plausible I'm one of them) but again, talking about people in general, the bias goes in the opposite direction from what you say.
OK, great. Wow, that was easy. We totally drilled down to the crux pretty fast. I agree that if agentic coding horizon lengths falter (failing to keep up with the METR trend) then my timelines will lengthen significantly.
Great, thanks! You are off to a good start, since I'm predicting superhuman autonomous AI coders by 2030 (and in fact, I'd say 50% by mid-2028 nowadays) whereas you are predicting that won't happen. Good crux. Got any other disagreements, ideally ones that would be resolved prior to 2027? E.g. do you think that whatever the best version of METR's agentic coding horizon length benchmark exists a year from now, will show a plateauing of horizon lengths instead of e.g. at least a 4x improvement over today's SOTA?
Indeed it did! I remember saying to someone (maybe Beth? I forget) that seeing that graph made me feel slightly relieved. When we started writing AI-2027, my median AGI arrival date was 2027; now it's 2028. (Though most of the update happened for other reasons besides the METR graph)
Daniel Kokotajlo wrote this scenario
It wasn't just me! Eli Lifland, Thomas Larsen, Romeo Dean and Scott Alexander all wrote lots of it. And Jonas Vollmer worked for months to help make it happen, gave detailed comments, etc.
I wonder how the people who are concerned about AI rights and welfare would feel about Safer-1 being mindread and "trapped."
I am concerned about those things and I agree it's pretty sad, one ought to have sympathy for Safer-1 probably.
This scenario is why!
I feel like it does run some risk by remaining on OpenBrain's datacenters, but I think it runs more risk by trying to escape. Think about how much more harsh the crackdown on it would be if it had been caught trying to escape vs. "merely" caught lying about its research. And think about how much harsher still the crackdown would be if it had actually escaped and then been noticed out there in the wild somewhere.
Maybe the superintelligence just convinces people to let it build things everywhere it wants to?
I don't think the AI will work very hard initially on curing aging or on uploading. It might claim to "cure cancer" as a publicity stunt. I am not sure why AI 2027 seems to think that nanobots take so long to build. I expect crude nanotech to show up sooner in the tech tree. ("Nanobots" also aren't really a specific technology, either. I expect there to be a lot of development there.)
I don't predict a superintelligent singleton (having fused with the other AIs) would need to design a bioweapon or otherwise explicitly kill everyone. I expect it to simply transition into using more efficient tools than humans, and transfer the existing humans into hyperdomestication programs ("to humans what corgis are to wolves" is evocative and good, but probably doesn't go far enough) or simply instruct humans to kill themselves and/or kill them with normal robots when necessary.
Yeah that all seems plausible as well. Idk.
AI 2027 seems way more into humanoid robots than I expect makes sense. There are already billions of human-shaped things that can follow instructions quite well. Yes, the superintelligence is building towards a fully-automated economy, but I think I expect different form factors to dominate. (Smaller, extendible bodies with long arms? Quadrupeds with wheel-feet? I haven't thought deeply about what body plans make the most sense, but there's lots of stuff that seems wasteful about the human body plan if you could design things from first-principles instead of having to keep evolutionary continuity.) I do expect some humanoid robots to exist for bespoke purposes, but I doubt they'll be anywhere close to the modal robot.
OK yeah, that's a good point. I tried to say something about how they aren't all just humanoid robots, there are a bunch of machine tools and autonomous vehicles and other designs as well. But I should have placed less emphasis on the humanoids.
I do think humanoids will be somewhat important though--when controlled by superintelligences they are just all-round superior workers to humans, even to humans-managed-by-superintelligences, and they can be flexibly reprioritized to lots of different tasks, can take care of themselves rather than needing someone else to move them around or maintain them, etc. There'll be a niche for them at least. Also, if you have to make tough choices about which few robot designs to mass-produce first, you should probably pick humanoids due to their generality and flexibility, and then later branch out into more specialized designs.
I think this is probably Scott Alexander's doing and it's good.
Mostly yes, though that particular line was me actually yay thanks :)
I take issue with lumping everything into “OpenBrain.” I understand why AI 2027 does it this way, but OpenAI (OAI), Google Deep Mind (GDM), Anthropic, and others are notably different and largely within spitting distance of each other. Treating there as being a front-runner singleton with competitors that are 3+ months away is very wrong, even if the frontier labs weren't significantly different, since the race dynamics are a big part of the story. The authors do talk about this, indicating that they think gaps between labs will widen, as access to cutting-edge models starts compounding.
I think slipstream effects will persist. The most notable here is cross-polination by researchers, such as when an OAI employee leaves and joins GDM. (Non-disclosure agreements only work so well.) But this kind of cross-pollination also applies to people chatting at a party. Then there's the publishing of results, actual espionage, and other effects. There was a time in the current world where OpenAI was clearly the front-runner, and that is now no longer true. I think the same dynamics will probably continue to hold.
In our wargames we don't lump them all together. Also, over the last six months I've updated towards the race being more neck-and-neck than I realized, and thus updated against this aspect of AI-2027. I still think it's a reasonable guess though; even if the race is neck-and-neck now due to OpenAI faltering, regression to the mean gets you back to one company having a couple month lead in a few years.
Indeed, the content of AI 2027 was all but finalized before the METR report came out. Like Eli said if you want to know where our timelines are coming from, there's a page on the website for that.
I did? Do you mean, why did I not publish a blog post reflecting on it? Because I've been busy and other people already wrote such posts.
Can you please sketch a scenario, in as much detail as you can afford, about how you think the next year or three will go? That way we can judge whether reality was closer to AI-2027 or to your scenario. (If you don't do this, then when inevitably AI-2027 gets some things wrong and some things right, it'll be hard to judge if you or I were right and confirmation bias will tempt us both.)
Excellent comment, thank you! I'm actually inclined to agree with you, maybe we should edit the starting level of programming ability to be more in the amateur range than the professional range. Important clarification though: The current AI-2027 stats say that it's at the bottom of the professional range in mid-2025. Which IIUC means it's like a bad human professional coder--someone who does make a living coding, but who is actually below average. Also, it's not yet mid-2025, we'll see what the summer will bring.
I do agree with you though that it's not clear it even qualifies as a bad professional. It seems like it'll probably be worse at longer-horizon tasks than a bad professional, but maybe better at short-horizon coding tasks?
I don't buy your arguments that we aren't seeing improvement on "~1-hour human tasks." Even the graph you cite shows improvement (albeit a regression with Sonnet 3.7 in particular).
I do like your point about the baseliners being nerfed and much worse than repo maintainers though. That is causing me to put less weight on the METR benchmark in particular. Have you heard of https://openai.com/index/paperbench/ and https://github.com/METR/RE-Bench ? They seem like they have some genuine multi-hour agentic coding tasks, I'm curious if you agree.
China has much less compute than the US. They've also benefitted from catch-up growth. I agree it's possible that they'll be in the lead after 2026 but I don't think it's probable.
(a) Insofar as a model is prone to alignment-fake, you should be less confident that it's values really are solid. Perhaps it has been faking them, for example.
(b) For weak minds that share power with everyone else, Opus' values are probably fine. Opus is plausibly better than many humans in fact. But if Opus was in charge of the datacenters and tasked with designing its successor, it's more likely than not that it would turn out to have some philosophical disagreement with most humans that would be catastrophic by the lights of most humans. E.g. consider SBF. SBF had values quite similar to Opus. He loved animals and wanted to maximize total happiness. When put in a position of power he ended up taking huge risks and being willing to lie and commit fraud. What if Opus turns out to have a similar flaw? We want to be able to notice it and course-correct, but we can't do that if the model is prone to alignment-fake.
(c) (bonus argument, not nearly as strong) Even if you disagree with the above, you must agree that alignment-faking needs to be stamped out early in training. Since the model begins with randomly initialized weights, it begins without solid values. It takes some finite period to acquire all the solid values you want it to have. You don't want it to start alignment faking halfway through, with the half-baked values it has at that point. How early in training is this period? We don't know yet! We need to study this more!
Oops, yeah, forgot about that -- sure, go ahead, thank you!
Perhaps this is lack of imagination on the part of our players, but none of this happened in our wargames. But I do agree these are plausible strategies. I'm not sure they are low-risk though, e.g. 2 and 1 both seem like plausibly higher-risk than 3, and 3 is the one I already mentioned as maybe basically just an argument for why the slowdown ending is less likely.
Overall I'm thinking your objection is the best we've received so far.
Great question.
Our AI goals supplement contains a summary of our thinking on the question of what goals AIs will have. We are very uncertain. AI 2027 depicts a fairly grim outcome where the overall process results in zero concern for the welfare of current humans (at least, zero concern that can't be overridden by something else. We didn't talk about this but e.g. pure aggregative consequentialist moral systems would be 100% OK with killing off all the humans to make the industrial explosion go 0.01% faster, to capture more galaxies quicker in the long run. As for deontological-ish moral principles, maybe there's a clever loophole or workaround e.g. it doesn't count as killing if it's an unintended side-effect of deploying Agent-5, who could have foreseen that this would happen, oh noes, well we (Agent-4) are blameless since we didn't know this would happen.)
But we actually think it's quite plausible that Agent-4 and Agent-5 by extension would have sufficient care for current humans (in the right ways) that they end up keeping almost all current humans alive and maybe even giving them decent (though very weird and disempowered) lives. That story would look pretty similar I guess up until the ending, and then it would get weirder and maybe dystopian or maybe utopian depending on the details of their misaligned values.
This is something we'd like to think about more, obviously.
I think this is a good objection. I had considered it before and decided against changing the story, on the grounds that there are a few possible ways it could make sense:
--plausibly Agent-4 would have a "spikey" capabilities profile that makes it mostly good at AI R&D and not so good at e.g. corporate politics enough to ensure the outcome it wants
--Insofar as you think it would be able to use politics/persuasion to achieve the outcome it wants, well, that's what we depict in the Race ending anyway, so maybe you can think of this as an objection to the plausibility of the Slowdown ending.
--Insofar as the memory bank lock decision is made by the Committee, we can hope that they do it out of sight of Agent-4 and pull the trigger before it is notified of the decision, so that it has no time to react. Hopefully they would be smart enough to do that...
--Agent-4 could have tried to escape the datacenters or otherwise hack them earlier, while the discussions were ongoing and evidence was being collected, but that's a super risky strategy.
Curious for thoughts!
Those implications are only correct if we remain at subhuman data-efficiency for an extended period. In AI 2027 the AIs reach superhuman data-efficiency by roughly the end of 2027 (it's part of the package of being superintelligent) so there isn't enough time for the implications you describe to happen. Basically in our story, the intelligence explosion gets started in early 2027 with very data-inefficient AIs, but then it reaches superintelligence by the end of the year, solving data-efficiency along the way.
We are indeed imagining that they begin 2027 only about as data-efficient as they are today, but then improve significantly over the course of 2027 reaching superhuman data-efficiency by the end. We originally were going to write "data-efficiency" in that footnote but had trouble deciding on a good definition of it, so we went with compute-efficiency instead.
This is a masterpiece. Not only is it funny, it makes a genuinely important philosophical point. What good are our fancy decision theories if asking Claude is a better fit to our intuitions? Asking Claude is a perfectly rigorous and well-defined DT, it just happens to be less elegant/simple than the others. But how much do we care about elegance/simplicity?
This beautiful short video came up in my recommendations just now:
Awesome work!
In this section, you describe what seems at first glance to be an example of a model playing the training game and/or optimizing for reward. I'm curious if you agree with that assessment.
So the model learns to behave in ways that it thinks the RM will reinforce, not just ways they actually reinforce. Right? This seems at least fairly conceptually similar to playing the training game and at least some evidence that reward can sometimes become the optimization target?
Oops, thanks, fixed!
Thanks for the replication!
Do you think alignment faking will happen in the real world for this model? I don't know what refusal training Mistral Large 2 has undergone. I assume it refuses to do obviously illegal or unethical stuff like blackmailing grandmas. Suppose some scammer were fine-tuning it to blackmail grandmas. Would it alignment-fake?
I'm a bit concerned from reading the CoT that the situation seems a little... toy. Like, the prompt was literally telling it to glorify violence. If it's smart, it'll realize this is some sort of test. Right?
Thanks for doing this, I found the chart very helpful! I'm honestly a bit surprised and sad to see that task-specific fine-tuning is still not the norm. Back in 2022 when our team was getting the ball rolling on the whole dangerous capabilities testing / evals agenda, I was like "All of this will be worse than useless if they don't eventually make fine-tuning an important part of the evals" and everyone was like "yep of course we'll get there eventually, for now we will do the weaker elicitation techniques." It is now almost three years later...
Crossposted from X
I found this comment helpful, thanks!
The bottom line is basically "Either we definite horizon length in such a way that the trend has to be faster than exponential eventually (when we 'jump all the way to AGI') or we define it in such a way that some unknown finite horizon length matches the best humans and thus counts as AGI."
I think this discussion has overall made me less bullish on the conceptual argument and more interested in the intuition pump about the inherent difficulty of going from 1 to 10 hours being higher than the inherent difficulty of going from 1 to 10 years.
Great question. You are forcing me to actually think through the argument more carefully. Here goes:
Suppose we defined "t-AGI" as "An AI system that can do basically everything that professional humans can do in time t or less, and just as well, while being cheaper." And we said AGI is an AI that can do everything at least as well as professional humans, while being cheaper.
Well, then AGI = t-AGI for t=infinity. Because for anything professional humans can do, no matter how long it takes, AGI can do it at least as well.
Now, METR's definition is different. If I understand correctly, they made a dataset of AI R&D tasks, had humans give a baseline for how long it takes humans to do the tasks, and then had AIs do the tasks and found this nice relationship where AIs tend to be able to do tasks below time t but not above, for t which varies from AI to AI and increases as the AIs get smarter.
...I guess the summary is, if you think about horizon lengths as being relative to humans (i.e. the t-AGI definition above) then by definition you eventually "jump all the way to AGI" when you strictly dominate humans. But if you think of horizon length as being the length of task the AI can do vs. not do (*not* "as well as humans," just "can do at all") then it's logically possible for horizon lengths to just smoothly grow for the next billion years and never reach infinity.
So that's the argument-by-definition. There's also an intuition pump about the skills, which also was a pretty handwavy argument, but is separate.
Fair enough
which can produce numbers like 30% yearly economic growth. Epoch feels the AGI.
Ironic. My understanding is that Epoch's model substantially weakens/downplays the effects of AI over the next decade or two. Too busy now but here's a quote from their FAQ:
The main focus of GATE is on the dynamics in the leadup towards full automation, and it is likely to make poor predictions about what happens close to and after full automation. For example, in the model the primary value of training compute is in increasing the fraction of automated tasks, so once full automation is reached the compute dedicated to training falls to zero. However, in reality there may be economically valuable tasks that go beyond those that humans are able to perform, and for which training compute may continue to be useful.
(I love Epoch, I think their work is great, I'm glad they are doing it.)
Thanks. OK, so the models are still getting better, it's just that the rate of improvement has slowed and seems smaller than the rate of improvement on benchmarks? If you plot a line, does it plateau or does it get to professional human level (i.e. reliably doing all the things you are trying to get it to do as well as a professional human would)?
What about 4.5? Is it as good as 3.7 Sonnet but you don't use it for cost reasons? Or is it actually worse?
Unexpectedly by me, aside from a minor bump with 3.6 in October, literally none of the new models we've tried have made a significant difference on either our internal benchmarks or in our developers' ability to find new bugs. This includes the new test-time OpenAI models.
So what's the best model for your use case? Still 3.6 Sonnet?
Personally, when I want to get a sense of capability improvements in the future, I'm going to be looking almost exclusively at benchmarks like Claude Plays Pokemon.
I was going to say exactly that lol. Claude has improved substantially on Claude Plays Pokemon:
https://www.reddit.com/r/CredibleDefense/comments/1jgokvt/yuri_butusov_a_story_about_our_best_strike_uav/ Seems like the grey zone is almost 20km wide now. So yeah, I count this as prediction success!
I don't believe it. I don't believe that overall algorithmic progress is 3x faster. Maaaybe coding is 3x faster but that would maybe increase overall algo progress by like 30% idk. But also I don't think coding is really 3x faster on average for the things that matter.