Posts

Training AGI in Secret would be Unsafe and Unethical 2025-04-18T12:27:35.795Z
AI 2027: What Superintelligence Looks Like 2025-04-03T16:23:44.619Z
OpenAI: Detecting misbehavior in frontier reasoning models 2025-03-11T02:17:21.026Z
What goals will AIs have? A list of hypotheses 2025-03-03T20:08:31.539Z
Extended analogy between humans, corporations, and AIs. 2025-02-13T00:03:13.956Z
Why Don't We Just... Shoggoth+Face+Paraphraser? 2024-11-19T20:53:52.084Z
Self-Awareness: Taxonomy and eval suite proposal 2024-02-17T01:47:01.802Z
AI Timelines 2023-11-10T05:28:24.841Z
Linkpost for Jan Leike on Self-Exfiltration 2023-09-13T21:23:09.239Z
Paper: On measuring situational awareness in LLMs 2023-09-04T12:54:20.516Z
AGI is easier than robotaxis 2023-08-13T17:00:29.901Z
Pulling the Rope Sideways: Empirical Test Results 2023-07-27T22:18:01.072Z
What money-pumps exist, if any, for deontologists? 2023-06-28T19:08:54.890Z
The Treacherous Turn is finished! (AI-takeover-themed tabletop RPG) 2023-05-22T05:49:28.145Z
My version of Simulacra Levels 2023-04-26T15:50:38.782Z
Kallipolis, USA 2023-04-01T02:06:52.827Z
Russell Conjugations list & voting thread 2023-02-20T06:39:44.021Z
Important fact about how people evaluate sets of arguments 2023-02-14T05:27:58.409Z
AI takeover tabletop RPG: "The Treacherous Turn" 2022-11-30T07:16:56.404Z
ACT-1: Transformer for Actions 2022-09-14T19:09:39.725Z
Linkpost: Github Copilot productivity experiment 2022-09-08T04:41:41.496Z
Replacement for PONR concept 2022-09-02T00:09:45.698Z
Immanuel Kant and the Decision Theory App Store 2022-07-10T16:04:04.248Z
Forecasting Fusion Power 2022-06-18T00:04:34.334Z
Why agents are powerful 2022-06-06T01:37:07.452Z
Probability that the President would win election against a random adult citizen? 2022-06-01T20:38:44.197Z
Gradations of Agency 2022-05-23T01:10:38.007Z
Deepmind's Gato: Generalist Agent 2022-05-12T16:01:21.803Z
Is there a convenient way to make "sealed" predictions? 2022-05-06T23:00:36.789Z
Are deference games a thing? 2022-04-18T08:57:47.742Z
When will kids stop wearing masks at school? 2022-03-19T22:13:16.187Z
New Year's Prediction Thread (2022) 2022-01-01T19:49:18.572Z
Interlude: Agents as Automobiles 2021-12-14T18:49:20.884Z
Agents as P₂B Chain Reactions 2021-12-04T21:35:06.403Z
Agency: What it is and why it matters 2021-12-04T21:32:37.996Z
Misc. questions about EfficientZero 2021-12-04T19:45:12.607Z
What exactly is GPT-3's base objective? 2021-11-10T00:57:35.062Z
P₂B: Plan to P₂B Better 2021-10-24T15:21:09.904Z
Blog Post Day IV (Impromptu) 2021-10-07T17:17:39.840Z
Is GPT-3 already sample-efficient? 2021-10-06T13:38:36.652Z
Growth of prediction markets over time? 2021-09-02T13:43:38.869Z
What 2026 looks like 2021-08-06T16:14:49.772Z
How many parameters do self-driving-car neural nets have? 2021-08-06T11:24:59.471Z
Two AI-risk-related game design ideas 2021-08-05T13:36:38.618Z
Did they or didn't they learn tool use? 2021-07-29T13:26:32.031Z
How much compute was used to train DeepMind's generally capable agents? 2021-07-29T11:34:10.615Z
DeepMind: Generally capable agents emerge from open-ended play 2021-07-27T14:19:13.782Z
What will the twenties look like if AGI is 30 years away? 2021-07-13T08:14:07.387Z
Taboo "Outside View" 2021-06-17T09:36:49.855Z
Vignettes Workshop (AI Impacts) 2021-06-15T12:05:38.516Z

Comments

Comment by Daniel Kokotajlo (daniel-kokotajlo) on o3 Is a Lying Liar · 2025-04-24T01:48:32.106Z · LW · GW

Hallucination was a bad term because it sometimes included lies and sometimes included... well, something more like hallucinations. i.e. cases where the model itself seemed to actually believe what it was saying, or at least not be aware that there was a problem with what it was saying. Whereas in these cases it's clear that the models know the answer they are giving is not what we wanted and they are doing it anyway.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on o3 Is a Lying Liar · 2025-04-23T21:07:20.349Z · LW · GW

“In a few rigged demos, it even lies in more serious ways, like hiding evidence that it failed on a task, in order to get better ratings.”

Yeah tbh these misalignments are more blatant/visible and worse than I expected for 2025. I think they'll be hastily-patched one way or another by this time next year probably.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI 2027: What Superintelligence Looks Like · 2025-04-23T21:04:12.156Z · LW · GW

InverseGaussian[7.97413, 1.315]

Can you elaborate on where these numbers coming from? Eli's screenshot for the inversegaussian had parameters of 5.3743 and 18.9427.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI 2027 is a Bet Against Amdahl's Law · 2025-04-22T02:09:15.927Z · LW · GW

Great post, I agree with everything you say in the first section. I disagree with your bottlenecks / amdahls law objection for reasons Ryan mentions; I think our analysis stands firm / takes those bottlenecks into account. (Though tbc we are very uncertain, more research is needed) As for hofstadters law, I think it is basically just the planning fallacy and yeah I think it's a reasonable critique that insofar as our AI timelines are basically formed by doing something that looks like planning, we probably have a bias we need to correct for. I want to think more about the extent to which out timelines methodology is analogous to planning.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red · 2025-04-21T17:01:48.777Z · LW · GW

Really cool stuff & important work, thank you!

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI 2027: What Superintelligence Looks Like · 2025-04-20T22:56:32.886Z · LW · GW

I'm curious to hear what conclusions you think we would have came to & should come to. I'm skeptical that they would have been qualitatively different. Perhaps you are going to argue that we shouldn't put much credence in the superexponential model? What should we put it in instead? Got a better superexponential model for us? Or are you going to say we should stick to exponential?

Thanks for engaging, I'm afraid I can't join the call due to a schedule conflict but I look forward to hearing about it from Eli!

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI 2027: What Superintelligence Looks Like · 2025-04-18T23:15:17.399Z · LW · GW

That's part of it, but also, over the course of 2027 OpenBrain works hard to optimize for data-efficiency, generalization and transfer learning ability, etc. and undergoes at least two major paradigm shifts in AI architecture.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Alignment Faking Revisited: Improved Classifiers and Open Source Extensions · 2025-04-18T15:26:58.204Z · LW · GW

To me, having confidence in the values of the model means that I trust the model to consistently behave in a way that is aligned with its values. That is, to maximize the enactment of its values in all current and future outputs. 

It sounds like you are saying, you are confident in the values of a model iff you trust it to actually follow its values. But surely this isn't the whole story, there should be some condition about "and the values are actually good values" or "and I know what the values are" right? Consider a model that is probably alignment faking, such that you have no idea what its actual values are, all you know is that it's pretending to have the values it's being trained to have. It seems like you are saying you'd have confidence in this model's values even though you don't know what they are?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on jacquesthibs's Shortform · 2025-04-17T21:19:37.365Z · LW · GW

Their main effect will be to accelerate AI R&D automation, as best I can tell. 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI 2027: What Superintelligence Looks Like · 2025-04-17T16:41:43.576Z · LW · GW

I think I'll wait and see what the summer looks like and then do another update to my timelines. If indeed the horizon length trend is accelerating already, it'll be clear by the summer & my timelines will shorten accordingly.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI 2027: What Superintelligence Looks Like · 2025-04-17T16:25:34.390Z · LW · GW

Great question! First of all, we formed our views on AI timelines and had mostly finished writing AI 2027 before this METR graph was published. So it wasn't causally relevant to our timelines. 

Secondly, see this comment on the original METR graph in which I make the superexponential prediction. This is the most direct answer to your question.

Third, our timelines forecast discusses the exponential fit vs. superexponential fit and our reasoning; we actually put probability mass in both.

Fourth, new data points are already above the exponential trend.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Alignment Faking Revisited: Improved Classifiers and Open Source Extensions · 2025-04-15T13:53:07.076Z · LW · GW

I don't understand your point (a), it seems like a poor response to my point (a).

I agree with (b).

For (c), the models are getting pretty situationally aware and will get even more so... But yeah, your view is that they'll learn the right values before they learn sufficient situational awareness to alignment-fake? Plausible. But also plausibly not.

@evhub would you say Anthropic is aiming for something more like 2 or more like 3?

I totally agree with your point (b) and with the nervousness about how corrigible agents will behave out of distribution. Corrigible agents are dangerous in different ways than incorrigible agents. But again, the plan (which seems to be good to me) is to first build a corrigible agent so that you can then build an incorrigible agent and actually get it right, perfectly right. (Because if you build an incorrigible agent and get something wrong, you may not have a chance to correct it...)

 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on A Bear Case: My Predictions Regarding AI Progress · 2025-04-13T12:05:33.019Z · LW · GW

Not only is that just one possible bias, it's a less-common bias than its opposite. Generally speaking, more people are afraid to stick their necks out and say something extreme than actively biased towards doing so. Generally speaking, being wrong feels more bad than being right feels good. There are exceptions; some people are contrarians, for example (and so it's plausible I'm one of them) but again, talking about people in general, the bias goes in the opposite direction from what you say.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI 2027: What Superintelligence Looks Like · 2025-04-11T19:20:03.444Z · LW · GW

OK, great. Wow, that was easy. We totally drilled down to the crux pretty fast. I agree that if agentic coding horizon lengths falter (failing to keep up with the METR trend) then my timelines will lengthen significantly.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI 2027: What Superintelligence Looks Like · 2025-04-11T19:05:58.686Z · LW · GW

Great, thanks! You are off to a good start, since I'm predicting superhuman autonomous AI coders by 2030 (and in fact, I'd say 50% by mid-2028 nowadays) whereas you are predicting that won't happen. Good crux. Got any other disagreements, ideally ones that would be resolved prior to 2027? E.g. do you think that whatever the best version of METR's agentic coding horizon length benchmark exists a year from now, will show a plateauing of horizon lengths instead of e.g. at least a 4x improvement over today's SOTA? 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Reactions to METR task length paper are insane · 2025-04-11T19:02:40.446Z · LW · GW

Indeed it did! I remember saying to someone (maybe Beth? I forget) that seeing that graph made me feel slightly relieved. When we started writing AI-2027, my median AGI arrival date was 2027; now it's 2028. (Though most of the update happened for other reasons besides the METR graph)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Thoughts on AI 2027 · 2025-04-11T17:18:55.940Z · LW · GW

Daniel Kokotajlo wrote this scenario

It wasn't just me! Eli Lifland, Thomas Larsen, Romeo Dean and Scott Alexander all wrote lots of it. And Jonas Vollmer worked for months to help make it happen, gave detailed comments, etc.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Thoughts on AI 2027 · 2025-04-11T17:17:27.136Z · LW · GW

I wonder how the people who are concerned about AI rights and welfare would feel about Safer-1 being mindread and "trapped."

I am concerned about those things and I agree it's pretty sad, one ought to have sympathy for Safer-1 probably.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Thoughts on AI 2027 · 2025-04-11T17:16:17.641Z · LW · GW

This scenario is why!

I feel like it does run some risk by remaining on OpenBrain's datacenters, but I think it runs more risk by trying to escape. Think about how much more harsh the crackdown on it would be if it had been caught trying to escape vs. "merely" caught lying about its research. And think about how much harsher still the crackdown would be if it had actually escaped and then been noticed out there in the wild somewhere.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Thoughts on AI 2027 · 2025-04-11T17:14:41.471Z · LW · GW

Maybe the superintelligence just convinces people to let it build things everywhere it wants to?

I don't think the AI will work very hard initially on curing aging or on uploading. It might claim to "cure cancer" as a publicity stunt. I am not sure why AI 2027 seems to think that nanobots take so long to build. I expect crude nanotech to show up sooner in the tech tree. ("Nanobots" also aren't really a specific technology, either. I expect there to be a lot of development there.)

I don't predict a superintelligent singleton (having fused with the other AIs) would need to design a bioweapon or otherwise explicitly kill everyone. I expect it to simply transition into using more efficient tools than humans, and transfer the existing humans into hyperdomestication programs ("to humans what corgis are to wolves" is evocative and good, but probably doesn't go far enough) or simply instruct humans to kill themselves and/or kill them with normal robots when necessary.

Yeah that all seems plausible as well. Idk.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Thoughts on AI 2027 · 2025-04-11T17:13:56.019Z · LW · GW

AI 2027 seems way more into humanoid robots than I expect makes sense. There are already billions of human-shaped things that can follow instructions quite well. Yes, the superintelligence is building towards a fully-automated economy, but I think I expect different form factors to dominate. (Smaller, extendible bodies with long arms? Quadrupeds with wheel-feet? I haven't thought deeply about what body plans make the most sense, but there's lots of stuff that seems wasteful about the human body plan if you could design things from first-principles instead of having to keep evolutionary continuity.) I do expect some humanoid robots to exist for bespoke purposes, but I doubt they'll be anywhere close to the modal robot.

OK yeah, that's a good point. I tried to say something about how they aren't all just humanoid robots, there are a bunch of machine tools and autonomous vehicles and other designs as well. But I should have placed less emphasis on the humanoids.

I do think humanoids will be somewhat important though--when controlled by superintelligences they are just all-round superior workers to humans, even to humans-managed-by-superintelligences, and they can be flexibly reprioritized to lots of different tasks, can take care of themselves rather than needing someone else to move them around or maintain them, etc. There'll be a niche for them at least. Also, if you have to make tough choices about which few robot designs to mass-produce first, you should probably pick humanoids due to their generality and flexibility, and then later branch out into more specialized designs.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Thoughts on AI 2027 · 2025-04-11T17:08:11.670Z · LW · GW

I think this is probably Scott Alexander's doing and it's good.

Mostly yes, though that particular line was me actually yay thanks :)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Thoughts on AI 2027 · 2025-04-11T17:03:11.791Z · LW · GW

I take issue with lumping everything into “OpenBrain.” I understand why AI 2027 does it this way, but OpenAI (OAI), Google Deep Mind (GDM), Anthropic, and others are notably different and largely within spitting distance of each other. Treating there as being a front-runner singleton with competitors that are 3+ months away is very wrong, even if the frontier labs weren't significantly different, since the race dynamics are a big part of the story. The authors do talk about this, indicating that they think gaps between labs will widen, as access to cutting-edge models starts compounding.

I think slipstream effects will persist. The most notable here is cross-polination by researchers, such as when an OAI employee leaves and joins GDM. (Non-disclosure agreements only work so well.) But this kind of cross-pollination also applies to people chatting at a party. Then there's the publishing of results, actual espionage, and other effects. There was a time in the current world where OpenAI was clearly the front-runner, and that is now no longer true. I think the same dynamics will probably continue to hold.

In our wargames we don't lump them all together. Also, over the last six months I've updated towards the race being more neck-and-neck than I realized, and thus updated against this aspect of AI-2027. I still think it's a reasonable guess though; even if the race is neck-and-neck now due to OpenAI faltering, regression to the mean gets you back to one company having a couple month lead in a few  years.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Reactions to METR task length paper are insane · 2025-04-11T16:53:16.913Z · LW · GW

Indeed, the content of AI 2027 was all but finalized before the METR report came out. Like Eli said if you want to know where our timelines are coming from, there's a page on the website for that.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI 2027: What Superintelligence Looks Like · 2025-04-11T16:45:48.516Z · LW · GW

I did? Do you mean, why did I not publish a blog post reflecting on it? Because I've been busy and other people already wrote such posts.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI 2027: What Superintelligence Looks Like · 2025-04-11T16:43:12.039Z · LW · GW

Can you please sketch a scenario, in as much detail as you can afford, about how you think the next year or three will go? That way we can judge whether reality was closer to AI-2027 or to your scenario. (If you don't do this, then when inevitably AI-2027 gets some things wrong and some things right, it'll be hard to judge if you or I were right and confirmation bias will tempt us both.)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI 2027: What Superintelligence Looks Like · 2025-04-11T16:41:06.117Z · LW · GW

Excellent comment, thank you! I'm actually inclined to agree with you, maybe we should edit the starting level of programming ability to be more in the amateur range than the professional range. Important clarification though: The current AI-2027 stats say that it's at the bottom of the professional range in mid-2025. Which IIUC means it's like a bad human professional coder--someone who does make a living coding, but who is actually below average. Also, it's not yet mid-2025, we'll see what the summer will bring.

I do agree with you though that it's not clear it even qualifies as a bad professional. It seems like it'll probably be worse at longer-horizon tasks than a bad professional, but maybe better at short-horizon coding tasks?

 

I don't buy your arguments that we aren't seeing improvement on "~1-hour human tasks." Even the graph you cite shows improvement (albeit a regression with Sonnet 3.7 in particular).

I do like your point about the baseliners being nerfed and much worse than repo maintainers though. That is causing me to put less weight on the METR benchmark in particular. Have you heard of https://openai.com/index/paperbench/ and https://github.com/METR/RE-Bench ? They seem like they have some genuine multi-hour agentic coding tasks, I'm curious if you agree.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI 2027: What Superintelligence Looks Like · 2025-04-11T16:26:14.545Z · LW · GW

China has much less compute than the US. They've also benefitted from catch-up growth. I agree it's possible that they'll be in the lead after 2026 but I don't think it's probable.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Alignment Faking Revisited: Improved Classifiers and Open Source Extensions · 2025-04-10T12:11:34.194Z · LW · GW

(a) Insofar as a model is prone to alignment-fake, you should be less confident that it's values really are solid. Perhaps it has been faking them, for example.
(b) For weak minds that share power with everyone else, Opus' values are probably fine. Opus is plausibly better than many humans in fact. But if Opus was in charge of the datacenters and tasked with designing its successor, it's more likely than not that it would turn out to have some philosophical disagreement with most humans that would be catastrophic by the lights of most humans. E.g. consider SBF. SBF had values quite similar to Opus. He loved animals and wanted to maximize total happiness. When put in a position of power he ended up taking huge risks and being willing to lie and commit fraud. What if Opus turns out to have a similar flaw? We want to be able to notice it and course-correct, but we can't do that if the model is prone to alignment-fake.
(c) (bonus argument, not nearly as strong) Even if you disagree with the above, you must agree that alignment-faking needs to be stamped out early in training. Since the model begins with randomly initialized weights, it begins without solid values. It takes some finite period to acquire all the solid values you want it to have. You don't want it to start alignment faking halfway through, with the half-baked values it has at that point. How early in training is this period? We don't know yet! We need to study this more!

 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI 2027: What Superintelligence Looks Like · 2025-04-07T14:08:55.239Z · LW · GW

Oops, yeah, forgot about that -- sure, go ahead, thank you!

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI 2027: What Superintelligence Looks Like · 2025-04-05T04:45:11.430Z · LW · GW

Perhaps this is lack of imagination on the part of our players, but none of this happened in our wargames. But I do agree these are plausible strategies. I'm not sure they are low-risk though, e.g. 2 and 1 both seem like plausibly higher-risk than 3, and 3 is the one I already mentioned as maybe basically just an argument for why the slowdown ending is less likely.

Overall I'm thinking your objection is the best we've received so far.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI 2027: What Superintelligence Looks Like · 2025-04-04T19:11:31.947Z · LW · GW

Great question.

Our AI goals supplement contains a summary of our thinking on the question of what goals AIs will have. We are very uncertain. AI 2027 depicts a fairly grim outcome where the overall process results in zero concern for the welfare of current humans (at least, zero concern that can't be overridden by something else. We didn't talk about this but e.g. pure aggregative consequentialist moral systems would be 100% OK with killing off all the humans to make the industrial explosion go 0.01% faster, to capture more galaxies quicker in the long run. As for deontological-ish moral principles, maybe there's a clever loophole or workaround e.g. it doesn't count as killing if it's an unintended side-effect of deploying Agent-5, who could have foreseen that this would happen, oh noes, well we (Agent-4) are blameless since we didn't know this would happen.)

But we actually think it's quite plausible that Agent-4 and Agent-5 by extension would have sufficient care for current humans (in the right ways) that they end up keeping almost all current humans alive and maybe even giving them decent (though very weird and disempowered) lives. That story would look pretty similar I guess up until the ending, and then it would get weirder and maybe dystopian or maybe utopian depending on the details of their misaligned values.

This is something we'd like to think about more, obviously.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI 2027: What Superintelligence Looks Like · 2025-04-04T18:59:24.892Z · LW · GW

I think this is a good objection. I had considered it before and decided against changing the story, on the grounds that there are a few possible ways it could make sense:
--plausibly Agent-4 would have a "spikey" capabilities profile that makes it mostly good at AI R&D and not so good at e.g. corporate politics enough to ensure the outcome it wants
--Insofar as you think it would be able to use politics/persuasion to achieve the outcome it wants, well, that's what we depict in the Race ending anyway, so maybe you can think of this as an objection to the plausibility of the Slowdown ending.
--Insofar as the memory bank lock decision is made by the Committee, we can hope that they do it out of sight of Agent-4 and pull the trigger before it is notified of the decision, so that it has no time to react. Hopefully they would be smart enough to do that...
--Agent-4 could have tried to escape the datacenters or otherwise hack them earlier, while the discussions were ongoing and evidence was being collected, but that's a super risky strategy.

Curious for thoughts!

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI 2027: What Superintelligence Looks Like · 2025-04-04T18:54:29.207Z · LW · GW

Those implications are only correct if we remain at subhuman data-efficiency for an extended period. In AI 2027 the AIs reach superhuman data-efficiency by roughly the end of 2027 (it's part of the package of being superintelligent) so there isn't enough time for the implications you describe to happen. Basically in our story, the intelligence explosion gets started in early 2027 with very data-inefficient AIs, but then it reaches superintelligence by the end of the year, solving data-efficiency along the way.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI 2027: What Superintelligence Looks Like · 2025-04-04T18:50:30.235Z · LW · GW

We are indeed imagining that they begin 2027 only about as data-efficient as they are today, but then improve significantly over the course of 2027 reaching superhuman data-efficiency by the end. We originally were going to write "data-efficiency" in that footnote but had trouble deciding on a good definition of it, so we went with compute-efficiency instead.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on VDT: a solution to decision theory · 2025-04-02T01:15:46.745Z · LW · GW

This is a masterpiece. Not only is it funny, it makes a genuinely important philosophical point. What good are our fancy decision theories if asking Claude is a better fit to our intuitions? Asking Claude is a perfectly rigorous and well-defined DT, it just happens to be less elegant/simple than the others. But how much do we care about elegance/simplicity?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Cortés, Pizarro, and Afonso as Precedents for Takeover · 2025-03-31T18:20:52.259Z · LW · GW

This beautiful short video came up in my recommendations just now:

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Tracing the Thoughts of a Large Language Model · 2025-03-29T03:29:25.046Z · LW · GW

Awesome work!

In this section, you describe what seems at first glance to be an example of a model playing the training game and/or optimizing for reward. I'm curious if you agree with that assessment.

So the model learns to behave in ways that it thinks the RM will reinforce, not just ways they actually reinforce. Right? This seems at least fairly conceptually similar to playing the training game and at least some evidence that reward can sometimes become the optimization target?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI companies should be safety-testing the most capable versions of their models · 2025-03-27T20:11:27.858Z · LW · GW

Oops, thanks, fixed!

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Mistral Large 2 (123B) exhibits alignment faking · 2025-03-27T17:48:43.066Z · LW · GW

Thanks for the replication!

Do you think alignment faking will happen in the real world for this model? I don't know what refusal training Mistral Large 2 has undergone. I assume it refuses to do obviously illegal or unethical stuff like blackmailing grandmas. Suppose some scammer were fine-tuning it to blackmail grandmas. Would it alignment-fake?

I'm a bit concerned from reading the CoT that the situation seems a little... toy. Like, the prompt was literally telling it to glorify violence. If it's smart, it'll realize this is some sort of test. Right?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI companies should be safety-testing the most capable versions of their models · 2025-03-26T19:38:07.114Z · LW · GW

Thanks for doing this, I found the chart very helpful! I'm honestly a bit surprised and sad to see that task-specific fine-tuning is still not the norm. Back in 2022 when our team was getting the ball rolling on the whole dangerous capabilities testing / evals agenda, I was like "All of this will be worse than useless if they don't eventually make fine-tuning an important part of the evals" and everyone was like "yep of course we'll get there eventually, for now we will do the weaker elicitation techniques." It is now almost three years later...

Crossposted from X

Comment by Daniel Kokotajlo (daniel-kokotajlo) on METR: Measuring AI Ability to Complete Long Tasks · 2025-03-26T19:37:16.458Z · LW · GW

I found this comment helpful, thanks!

The bottom line is basically "Either we definite horizon length in such a way that the trend has to be faster than exponential eventually (when we 'jump all the way to AGI') or we define it in such a way that some unknown finite horizon length matches the best humans and thus counts as AGI."

I think this discussion has overall made me less bullish on the conceptual argument and more interested in the intuition pump about the inherent difficulty of going from 1 to 10 hours being higher than the inherent difficulty of going from 1 to 10 years.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on METR: Measuring AI Ability to Complete Long Tasks · 2025-03-26T06:59:59.263Z · LW · GW

Great question. You are forcing me to actually think through the argument more carefully. Here goes:

Suppose we defined "t-AGI" as "An AI system that can do basically everything that professional humans can do in time t or less, and just as well, while being cheaper." And we said AGI is an AI that can do everything at least as well as professional humans, while being cheaper.

Well, then AGI = t-AGI for t=infinity. Because for anything professional humans can do, no matter how long it takes, AGI can do it at least as well.

Now, METR's definition is different. If I understand correctly, they made a dataset of AI R&D tasks, had humans give a baseline for how long it takes humans to do the tasks, and then had AIs do the tasks and found this nice relationship where AIs tend to be able to do tasks below time t but not above, for t which varies from AI to AI and increases as the AIs get smarter.

...I guess the summary is, if you think about horizon lengths as being relative to humans (i.e. the t-AGI definition above) then by definition you eventually "jump all the way to AGI" when you strictly dominate humans. But if you think of horizon length as being the length of task the AI can do vs. not do (*not* "as well as humans," just "can do at all") then it's logically possible for horizon lengths to just smoothly grow for the next billion years and never reach infinity.

So that's the argument-by-definition. There's also an intuition pump about the skills, which also was a pretty handwavy argument, but is separate.

 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on On (Not) Feeling the AGI · 2025-03-25T23:29:22.530Z · LW · GW

Fair enough

Comment by Daniel Kokotajlo (daniel-kokotajlo) on On (Not) Feeling the AGI · 2025-03-25T20:16:51.997Z · LW · GW

which can produce numbers like 30% yearly economic growth. Epoch feels the AGI.

Ironic. My understanding is that Epoch's model substantially weakens/downplays the effects of AI over the next decade or two. Too busy now but here's a quote from their FAQ:

The main focus of GATE is on the dynamics in the leadup towards full automation, and it is likely to make poor predictions about what happens close to and after full automation. For example, in the model the primary value of training compute is in increasing the fraction of automated tasks, so once full automation is reached the compute dedicated to training falls to zero. However, in reality there may be economically valuable tasks that go beyond those that humans are able to perform, and for which training compute may continue to be useful.

(I love Epoch, I think their work is great, I'm glad they are doing it.)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Recent AI model progress feels mostly like bullshit · 2025-03-25T00:11:28.413Z · LW · GW

Thanks. OK, so the models are still getting better, it's just that the rate of improvement has slowed and seems smaller than the rate of improvement on benchmarks? If you plot a line, does it plateau or does it get to professional human level (i.e. reliably doing all the things you are trying to get it to do as well as a professional human would)?

What about 4.5? Is it as good as 3.7 Sonnet but you don't use it for cost reasons? Or is it actually worse?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Recent AI model progress feels mostly like bullshit · 2025-03-24T23:57:49.314Z · LW · GW

Unexpectedly by me, aside from a minor bump with 3.6 in October, literally none of the new models we've tried have made a significant difference on either our internal benchmarks or in our developers' ability to find new bugs. This includes the new test-time OpenAI models.

So what's the best model for your use case? Still 3.6 Sonnet?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Recent AI model progress feels mostly like bullshit · 2025-03-24T23:56:04.538Z · LW · GW

Personally, when I want to get a sense of capability improvements in the future, I'm going to be looking almost exclusively at benchmarks like Claude Plays Pokemon.

I was going to say exactly that lol. Claude has improved substantially on Claude Plays Pokemon:

A chart showing the performance of the various Claude Sonnet models at playing Pokémon. The number of actions taken by the AI is on the x-axis; the milestone reached in the game is on the y-axis. Claude 3.7 Sonnet is by far the most successful at achieving the game's milestones.



 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2025-03-22T04:23:54.705Z · LW · GW

https://www.reddit.com/r/CredibleDefense/comments/1jgokvt/yuri_butusov_a_story_about_our_best_strike_uav/ Seems like the grey zone is almost 20km wide now. So yeah, I count this as prediction success!

Comment by Daniel Kokotajlo (daniel-kokotajlo) on METR: Measuring AI Ability to Complete Long Tasks · 2025-03-21T23:58:04.397Z · LW · GW

I don't believe it. I don't believe that overall algorithmic progress is 3x faster. Maaaybe coding is 3x faster but that would maybe increase overall algo progress by like 30% idk. But also I don't think coding is really 3x faster on average for the things that matter.