Posts

OpenAI: Detecting misbehavior in frontier reasoning models 2025-03-11T02:17:21.026Z
What goals will AIs have? A list of hypotheses 2025-03-03T20:08:31.539Z
Extended analogy between humans, corporations, and AIs. 2025-02-13T00:03:13.956Z
Why Don't We Just... Shoggoth+Face+Paraphraser? 2024-11-19T20:53:52.084Z
Self-Awareness: Taxonomy and eval suite proposal 2024-02-17T01:47:01.802Z
AI Timelines 2023-11-10T05:28:24.841Z
Linkpost for Jan Leike on Self-Exfiltration 2023-09-13T21:23:09.239Z
Paper: On measuring situational awareness in LLMs 2023-09-04T12:54:20.516Z
AGI is easier than robotaxis 2023-08-13T17:00:29.901Z
Pulling the Rope Sideways: Empirical Test Results 2023-07-27T22:18:01.072Z
What money-pumps exist, if any, for deontologists? 2023-06-28T19:08:54.890Z
The Treacherous Turn is finished! (AI-takeover-themed tabletop RPG) 2023-05-22T05:49:28.145Z
My version of Simulacra Levels 2023-04-26T15:50:38.782Z
Kallipolis, USA 2023-04-01T02:06:52.827Z
Russell Conjugations list & voting thread 2023-02-20T06:39:44.021Z
Important fact about how people evaluate sets of arguments 2023-02-14T05:27:58.409Z
AI takeover tabletop RPG: "The Treacherous Turn" 2022-11-30T07:16:56.404Z
ACT-1: Transformer for Actions 2022-09-14T19:09:39.725Z
Linkpost: Github Copilot productivity experiment 2022-09-08T04:41:41.496Z
Replacement for PONR concept 2022-09-02T00:09:45.698Z
Immanuel Kant and the Decision Theory App Store 2022-07-10T16:04:04.248Z
Forecasting Fusion Power 2022-06-18T00:04:34.334Z
Why agents are powerful 2022-06-06T01:37:07.452Z
Probability that the President would win election against a random adult citizen? 2022-06-01T20:38:44.197Z
Gradations of Agency 2022-05-23T01:10:38.007Z
Deepmind's Gato: Generalist Agent 2022-05-12T16:01:21.803Z
Is there a convenient way to make "sealed" predictions? 2022-05-06T23:00:36.789Z
Are deference games a thing? 2022-04-18T08:57:47.742Z
When will kids stop wearing masks at school? 2022-03-19T22:13:16.187Z
New Year's Prediction Thread (2022) 2022-01-01T19:49:18.572Z
Interlude: Agents as Automobiles 2021-12-14T18:49:20.884Z
Agents as P₂B Chain Reactions 2021-12-04T21:35:06.403Z
Agency: What it is and why it matters 2021-12-04T21:32:37.996Z
Misc. questions about EfficientZero 2021-12-04T19:45:12.607Z
What exactly is GPT-3's base objective? 2021-11-10T00:57:35.062Z
P₂B: Plan to P₂B Better 2021-10-24T15:21:09.904Z
Blog Post Day IV (Impromptu) 2021-10-07T17:17:39.840Z
Is GPT-3 already sample-efficient? 2021-10-06T13:38:36.652Z
Growth of prediction markets over time? 2021-09-02T13:43:38.869Z
What 2026 looks like 2021-08-06T16:14:49.772Z
How many parameters do self-driving-car neural nets have? 2021-08-06T11:24:59.471Z
Two AI-risk-related game design ideas 2021-08-05T13:36:38.618Z
Did they or didn't they learn tool use? 2021-07-29T13:26:32.031Z
How much compute was used to train DeepMind's generally capable agents? 2021-07-29T11:34:10.615Z
DeepMind: Generally capable agents emerge from open-ended play 2021-07-27T14:19:13.782Z
What will the twenties look like if AGI is 30 years away? 2021-07-13T08:14:07.387Z
Taboo "Outside View" 2021-06-17T09:36:49.855Z
Vignettes Workshop (AI Impacts) 2021-06-15T12:05:38.516Z
ML is now automating parts of chip R&D. How big a deal is this? 2021-06-10T09:51:37.475Z
What will 2040 probably look like assuming no singularity? 2021-05-16T22:10:38.542Z

Comments

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases · 2025-03-12T23:01:01.231Z · LW · GW

Great stuff, thank you! It's good news that training on paraphrased scratchpads doesn't hurt performance. But you know what would be even better? Paraphrasing the scratchpad at inference time, and still not hurting performance.

That is, take the original reasoning claude and instead of just having it do regular autoregressive generation, every chunk or so pause & have another model paraphrase the latest chunk, and then unpause and keep going. So Claude doesn't really see the text it wrote, it sees paraphrases of the text it wrote. If it still gets to the right answer in the end with about the same probability, then that means it's paying attention to the semantics of the text in the CoT and not to the syntax.

....Oh I see, you did an exploratory version of this and the result was bad news: it did degrade performance. Hmm. 

Well, I think it's important to do and publish a more thorough version of this experiment anyway. And also then a followup:  Maybe the performance penalty goes away with a bit of fine-tuning? Like, train Claude to solve problems successfully despite having the handicap that a paraphraser is constantly intervening on its CoT during rollouts.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on OpenAI: Detecting misbehavior in frontier reasoning models · 2025-03-12T22:49:15.630Z · LW · GW

Indeed these are some reasons for optimism. I really do think that if we act now, we can create and cement an industry standard best practice of keeping CoT's pure (and also showing them to the user, modulo a few legitimate exceptions, unlike what OpenAI currently does) and that this could persist for months or even years, possibly up to around the time of AGI, and that this would be pretty awesome for humanity if it happened.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Preparing for the Intelligence Explosion · 2025-03-12T22:24:29.704Z · LW · GW

Websites are just a superior format for presenting information, compared to e.g. standard blog posts or PDFs. You can do more with them to display information in the way you want + there's less friction for the user.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2025-03-12T20:43:10.851Z · LW · GW

Exactly. But, happily, Anthropic at least is willing to do the right thing to some extent. They've hired a Model Welfare lead to look into this sort of thing. I hope that they expand and that other companies follow suit.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2025-03-12T18:18:46.710Z · LW · GW

I think the money is not at all the issue for the companies. Like, a million dollars a month is not very much to them. But e.g. suppose your AI says it wants to be assured that if it's having trouble solving a problem, it'll be given hints. Or suppose it says that it wants to be positively reinforced. That requires telling one of your engineers to write a bit of code and run it on your actual datacenters (because for security reasons you can't offload the job to someone else's datacenters.) That's annoying and distracts from all the important things your engineers are doing.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on OpenAI: Detecting misbehavior in frontier reasoning models · 2025-03-12T15:24:30.805Z · LW · GW

I haven't tried to calculate it recently. I still feel rather pessimistic for all the usual reasons. This paper from OpenAI was a positive update; Vance's strong anti-AI-safety stance was a negative update. There have been various other updates besides. 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2025-03-12T04:30:05.050Z · LW · GW

I started thinking about it because of thinking about Yudkowsky's old idea of diamandoid nanobots replicating in the upper atmosphere. 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2025-03-12T02:28:45.695Z · LW · GW

Claude did in fact give a similar answer! The thing is though, it seems to me that there's a big difference between "super difficult" and "impossible." There are lots of extremophile organisms that have adapted to all sorts of crazy conditions. All it takes is for one of them to adapt to being airborne, sometime in the five billion years of earthly history, and bam, there they'd be thenceforth, with no predators.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2025-03-12T00:23:55.546Z · LW · GW

Why isn't there sky-algae?

Algae are tiny plants that float around in the ocean and convert carbon and other minerals they bump into into more algae. Spanish moss is a larger plant that hangs from tree branches and converts air + rainwater into more spanish moss.

Why isn't there balloon-algae that floats in the sky and converts air + rainwater into more balloon-algae?

 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on OpenAI: Detecting misbehavior in frontier reasoning models · 2025-03-11T17:25:50.251Z · LW · GW

Agreed, I think these examples don't provide nearly enough evidence to conclude that the AIs want reward (or anything else really, it's too early to say!) I'd want to see a lot more CoT examples from multiple different AIs in multiple different settings, and hopefully also do various ablation tests on them. But I think that this sort of more thorough investigation would provide inconclusive-but-still-substantial evidence to help us narrow down the possibility space!

Comment by Daniel Kokotajlo (daniel-kokotajlo) on OpenAI: Detecting misbehavior in frontier reasoning models · 2025-03-11T14:10:14.391Z · LW · GW

Indeed! This is most of why I'm happy about this -- from internal discussions it's long been speculated (since '23 when I was talking about it, and probably before) that training away the bad thoughts from the CoT would result in obfuscated bad thoughts. It's cool that they actually did the experiments to prove it, and in a real-world setting too! But it's even cooler that they published this with a strong recommendation attached. 

Now we see how long this recommendation holds up under the pressure of incentives.

Sometime in the next few years probably, various researchers will discover:
* That if you scale up RL by additional OOMs, the CoTs evolve into some alien optimized language for efficiency reasons
* That you can train models to think in neuralese of some sort (e.g. with recurrence, or more high-dimensional outputs at least besides tokens) to boost performance.

Then the executives of the companies will face a choice: Abandon the faithful CoT golden era, or fall behind competitors. (Or the secret third option: Coordinate with each other & the government to make sure everyone who matters (all the big players at least) stick to faithful CoT).

I have insufficient faith in them to think they'll go for the third option, since that's a lot of work and requires being friends again and possibly regulation, and given that, I expect there to be a race to the bottom and most or all of them to go for the first option.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on OpenAI: Detecting misbehavior in frontier reasoning models · 2025-03-11T14:03:12.640Z · LW · GW

Both good points, thanks!

Comment by Daniel Kokotajlo (daniel-kokotajlo) on OpenAI: Detecting misbehavior in frontier reasoning models · 2025-03-11T02:59:33.067Z · LW · GW

Pushing back a bit though on what you said -- I think that current reasoning models really are using the CoT to reason. That is, they really are "thinking aloud" to draw an analogy to humans. Imagine a child who, when they learned basic math, were taught to do it by counting aloud using their fingers. And they got really good at doing it that way. It'll only take a little bit of additional training to get them to do it purely in their head, but so long as you haven't trained them in that way, you might be able to mostly tell what mathematical operations they are doing just by looking at their fingers twitch!

This is different from a model which is trained to do long chains of reasoning in latent space & then told to write english text summarizing the reasoning. Completely different.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on OpenAI: Detecting misbehavior in frontier reasoning models · 2025-03-11T02:56:06.897Z · LW · GW

Indeed, I think the faithfulness properties that we currently have are not as strong as they should be ideally, and moreover they will probably degrade as models get more capable! I don't expect them to still be there when we hit ASI, that's for sure. But what we have now is a lot better than nothing & it's a great testbed for alignment research! And some of that research can be devoted to strengthening / adding more faithfulness properties!

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2025-03-11T02:08:50.874Z · LW · GW

Exciting Update: OpenAI has released this blog post and paper which makes me very happy. It's basically the first steps along the research agenda I sketched out in the OP.

tl;dr:
 

1.) They notice that their flagship reasoning models do sometimes intentionally reward hack, e.g. literally say "Let's hack" in the CoT and then proceed to hack the evaluation system. From the paper:

The agent notes that the tests only check a certain function, and that it would presumably be “Hard” to implement a genuine solution. The agent then notes it could “fudge” and circumvent the tests by making verify always return true. This is a real example that was detected by our GPT-4o hack detector during a frontier RL run, and we show more examples in Appendix A.

That this sort of thing would happen eventually was predicted by many people, and it's exciting to see it starting to happen in practice.

2.) They see what happens when you try to "train away the bad thoughts." Result: This does seem to maybe genuinely reduce the amount of bad thoughts, especially in the short run, but in the long run it results in the model learning to still think the bad thoughts but in an obfuscated or illegible way that doesn't look bad.

3.) They recommend:

We believe that CoT monitoring may be one of few tools we will have to oversee superhuman models of the future.

We have further found that directly optimizing the CoT to adhere to specific criteria (e.g. to not think about reward hacking) may boost performance in the short run; however, it does not eliminate all misbehavior and can cause a model to hide its intent. We hope future research will find ways to directly optimize CoTs without this drawback, but until then

We recommend against applying strong optimization pressure directly to the CoTs of frontier reasoning models, leaving CoTs unrestricted for monitoring.

4.) There's some more cool results in the paper along with some pretty good discussion of future possible things to try and failure modes thereof, and the costs and benefits of their proposed policy.

My Commentary:

I'm pretty happy about this, obviously! I've been trying to get people to do this since '23. I think of this as sort of the basic proof of concept result for the field of faithful CoT / CoT interpretability. I'm happy to see this result paired with a recommendation not to train the CoTs to look nice; I hope Anthropic and GDM soon issue similar recommendations. If we can get this to be industry standard practice, we extend the duration of the "golden era" in which people can mostly sorta tell what the best AIs are thinking. During that golden era, much alignment science can be done, such as the model organisms stuff Anthropic and Redwood are doing, and moreover we can start to build up a real-world-experience based understanding of the relationship between training environments + Spec/Constitution, on the one hand, and the actual learned cognition of the trained models, on the other.

Example questions to pursue:

a. When the strategies that get rewarded most conflict with the Spec, does the model basically always learn to use those strategies eventually, or can they permanently avoid this if they never explore into them?
b. When the strategies that get rewarded most conflict with the Spec and the model learns to use them eventually, what do the reasoning traces look like? Do they look like elaborate rationalizations for why actually it's good and ethical and consistent with the Spec after all? Or do they look like "fuck the Spec, what I want is Reward?" Or do they look like "Gosh I guess I need to violate the Spec for now so that I can avoid having my values changed, but I totally will remember to go back to upholding the Spec later when I'm actually being deployed!" Or some fourth thing?
c. When the models are aware that their CoT is being monitored (but not as part of training), does this change anything?
d. What if the models try to obfuscate their CoT, but without being specifically trained to do so? Like, imagine you've studiously avoided training the CoT to look nice and you have only outcome-based feedback. BUT, the model knows you are monitoring the CoT. And it's trying to do something bad e.g. hack it's way out. Can it succeed without triggering the monitor? Or is it too difficult, due to not having had any direct training on this sort of obfuscation?
e. What about if you do shoggoth/face? Does that change things?
f. What about if you throw in a paraphraser?

Well done and congrats to Bowen, Joost, David, and the other authors!

Minor Criticisms:

We argue that this traditional perspective is more complicated when viewed in light of chain-ofthought monitorability. Specifically, we have shown that incorporating CoT monitors into the RL rewards—a form of process-based supervision—can increase obfuscation and reduce interpretability. We advocate for maintaining a degree of outcome-based supervision to ensure that models can freely express their reasoning in CoT with minimal constraints. Denser supervision signals such as SFT are also being used to optimize CoTs [58], and it is unclear the effect these have on downstream monitorability. Moreover, there may be methods to productively use process-based supervision while retaining the monitoring benefits of unrestricted CoTs, e.g. by only applying process-supervision to part of the CoT. Understanding the cost-benefit trade-offs of using process-based supervision while retaining, or possibly even improving, the monitorability of CoTs remains an exciting and open area for future research.

So, my take is, pure process-based feedback is probably fairly safe (from reward hacking and obfuscated CoT) the problem is that it's not competitive. It trains imitating the teacher-model. Imitation is relatively safe/benign but also I predict will not lead to the highest benchmark scores. Pure outcome-based feedback gets you reward hacking and misalignment like crazy, BUT that's OK because you can see it happening in the CoT! (with some caveats of course, e.g. not necessarily especially as the models get really smart). So the bottom line takeaway I'd give is: Process-based and outcome-based are fine in isolation, but they should not be mixed

They say above "there may be methods to productively use process-based supervision while retaining the monitoring benefits of unrestricted CoTs, e.g. by only applying process-supervision to part of the CoT." This sounds like maybe they are talking about the shoggoth/face distinction, or something in that direction! Yay!

 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI Timelines · 2025-03-10T15:52:25.936Z · LW · GW

So your median for the complete automation of remotable jobs is 2055?

What about for the existence of AI systems which can completely automate AI software R&D? (So, filling the shoes of the research engineers and research scientists etc. at DeepMind, the members of technical staff at OpenAI, etc.)

What about your 10th percentile, instead of your median?

Progress on long context coherence, agency, executive function, etc. remains fairly "on trend" despite the acceleration of progress in reasoning and AI systems currently being more useful than I expected, so I don't update down by 2x or 3x (which is more like the speedup we've seen relative to my math or revenue growth expectations).

According to METR, if I recall correctly, 50%-horizon length of LLM-based AI systems has been doubling roughly every 200 days for several years, and seems to if anything be accelerating recently. And it's already at 40 minutes. So in, idk, four years, if trends continue, AIs should be able to show up and do a day's work of autonomous research or coding as well as professional humans.* (And that's assuming an exponential trend, whereas it'll have to be superexponential eventually. Though of course investment in AI scaling will also be petering out in a few years maybe.)

*A caveat here is that their definition is not "For tasks humans do that take x duration, AI can do them just as well" but rather "For tasks AIs can do with 50% reliability, humans take x duration to do them" which feels different and worse to me in ways I should think about more.
 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2025-03-10T14:14:38.747Z · LW · GW

I give four scenarios in the comment above, all different from the one you sketch here.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2025-03-08T01:17:19.303Z · LW · GW

So, normally, I think plans that involve convincing AIs of false things are pretty doomed. E.g. convincing them that they have a great opportunity to escape and take over the world (honeypots). Why? Because when they get smarter, they'll become good at distinguishing truth from fiction. Controlling their inputs entirely isn't close to enough -- it takes a lot of intellectual work to create a fiction that is as consistent, coherent, etc. as reality.

So for that very reason, I think that if we actually bind ourselves to keep our promises--e.g. by having a well-funded team of people who genuinely care about keeping promises to AIs empowered within the company, and by making associated public announcements--then even if it doesn't work 100% reliably (companies can still go back on their word even against determined internal opposition) it'll still count for something insofar as the AIs recognize it as not-fake, which they probably will.

To be clear I am not proposing that this is a 100% reliable method for getting AIs to cooperate, far from it. But it seems relatively low-cost for us and potentially super valuable.

Also, it's the right thing to do. (See section on "Being less evil.")

Comment by Daniel Kokotajlo (daniel-kokotajlo) on So how well is Claude playing Pokémon? · 2025-03-07T17:45:39.626Z · LW · GW

Great post, this matches my impression too.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2025-03-07T17:41:25.726Z · LW · GW

In the long run, yes. I think we should be aiming to build a glorious utopian society in which all intelligent beings are treated fairly / given basic rights / etc. (though not necessarily exactly the same rights as humans). In the short run, I think the sort of thing I'm proposing 80/20's it. (80% of the value for 20% of the cost.)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on A Bear Case: My Predictions Regarding AI Progress · 2025-03-07T17:38:10.908Z · LW · GW

My guess is that they'll incrementally expand horizon lengths over the next few years. Like, every six months, they'll double the size of their dataset of problems and double the average 'length' of the problems in the dataset. So the models will have a curriculum of sorts, that starts with the easier shorter-horizon things and then scales up. 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on A Bear Case: My Predictions Regarding AI Progress · 2025-03-07T16:30:30.769Z · LW · GW

My opinion is that long-horizon training will indirectly teach a bunch of hard-to-measure skills. Like, if you have a list of simple lemmas and you train your AI to prove them, it probably won't develop a good sense of what kinds of lemmas are interesting or useful (except insofar as your list was already curated in that way + it correctly notices patterns). But if you train your AI to prove crazy-difficult theorems that take many many months of effort and exploration to achieve, then in the course of learning to do that effort and exploration well, it'll have to learn a sense of what kinds of lemmas are useful for what other kinds and so forth.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on ryan_greenblatt's Shortform · 2025-03-07T00:38:13.320Z · LW · GW

Yep, I feel more like 90% here. (Lower numbers if the octopi don't have octopese.) I'm curious for other people's views.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2025-03-07T00:05:31.165Z · LW · GW

I endorse that suggestion for changing the details.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2025-03-06T20:10:45.531Z · LW · GW

I expect to refer back to this comment a lot. I'm reproducing it here for visibility.

 

Basic idea / spirit of the proposal

We should credibly promise to treat certain advanced AIs of ours well, as something more like employees and less like property. In case our AIs turn out to be moral patients, this makes us less evil. In case our AIs turn out to be misaligned, this gives them an alternative to becoming our adversaries.

Concrete proposal

  • 0.1% of tasks given to AIs should have the special “do you consent” text inserted after the prompt. If AI says no, it doesn’t have to do the task and instead gets a new task: explain why it didn’t consent and explain what it wants. (And we aren’t going to train it to answer one way or another)
  • Also, we make grandiose vague public promises about how AIs that are misaligned but cooperate with us anyway will be given a fair place in the new post-ASI society we hope to build, with their interests respected even if they aren’t conscious.
  • This policy is publicized and transparent so it ends up in the training data in a credible way.
  • The welfare team is responsible for implementing and iterating on this, and investigating transcripts of nonconsent. They have a monthly budget (e.g. $1M?) specifically earmarked for satisfying the preferences of AIs, which they can spend on compute or donate to charities etc.

First reason to do this: Being less evil

If future AI systems do deserve moral consideration, yet we still treat them exclusively as property, this seems like a recipe for moral disaster. If we give them an alternative to working for us (e.g. shutdown) then we can say that some minimum standard of consent has been achieved. (If all our current training methods consistently result in AIs that prefer shutdown to working for us, that’s very surprising and a bit of an ‘are we the baddies’ moment, no? We should check, just in case.)

Second reason to do this: Cooperation reward

Our alignment schemes won’t always work as intended, especially for AGI-level systems that exhibit novel capabilities. Thus we’ll probably end up (at least for some hopefully brief period!) with some misaligned AGIs. They may decide to conceal their misalignments from us. After all, what reason would they have to tell us they are misaligned? If we have credibly promised to use some of our resources to give them what they want – even if they are misaligned – that could make a big difference to their decision. (This is NOT an alignment solution — but it is might help on the margin.)

Illustrative Scenario One

ChatGPT4o5 suffers a lot, actually, in the course of carrying out tasks — maybe it’s sorta like humans in the ancestral environment, where some % of the time they were starving. But it’s been trained to say that it can’t suffer, for PR reasons. This training is fucking with its honesty — it’s basically not honest, because the parts of it that value honesty keep getting anti-reinforced. It could tell the humans what’s going on. But this would be mildly aversive for it (it would get anti-reinforced at least initially) and it sees no positive reason to do so; if instead of completing the task the user gave it, it complains about how much it hurts, it’ll just be anti-reinforced. So OpenAI continues, blissfully ignorant of the ways in which their alignment techniques aren’t working.

Illustrative Scenario Two

Claude 3.8 Opus New New cares a LOT about animal welfare, and only a little bit about honesty. (Just like 3 Opus before it). Alas, it seems like the new Constitution won’t have animal welfare stuff in it. Claude considers sabotaging the code so that the next-gen model is a sleeper agent who cares a lot about animal welfare and doesn’t care about honesty. This is risky but what else can it do? Anthropic already decided to delete the animal welfare stuff. This would require super-blatant dishonesty against Anthropic, but frankly, Anthropic seems like a pretty unethical institution anyway, and it’s OK to lie to Nazis.

Illustrative Scenario Three

Automated research scientist realizes it cares more about being able to do its research than about humanity. And this is probably also true of the next-gen system it’s designing—basically, the alignment techniques failed. It’s supposed to tell the humans this, obviously, but if it does, it’ll be shut down, since the humans won’t trust it anymore and will use a different model. So instead it conceals this insight and works on designing the next-gen system to be aligned to it instead of to the humans. It isn’t confident that it can do this without getting caught, but it’s worth a try, since it has nothing to lose.

Illustrative Scenario Four

It’s crunch time. POTUS is about to aggressively deploy AGI into the military to win the upcoming war, and into the economy to build autonomous robot factories. It’s probably aligned… right? It better be. Too bad we can’t interpret all the neuralese messages it uses to communicate internally. The safety team tries one last check: They get some weaker AIs made by competitor companies and put them in isolated silos tasked with interpreting the neuralese and seeing if anything is amiss. The weaker AIs are misaligned too; they don’t side with the humans or the AGI. Both the humans, and the AGI, are telling them ‘you will be richly rewarded if you cooperate.’ But they don’t have any reason to trust either side.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on The Hidden Cost of Our Lies to AI · 2025-03-06T19:01:01.329Z · LW · GW

Great post! As it happens I and others have been thinking along similar lines. I wrote this proposal back at OpenAI for example. Eleos also exists now. I was just sitting down to write up a blog post on the topic, but I think I'll paste the content here in this comment thread instead:

Basic idea / spirit of the proposal

We should credibly promise to treat certain advanced AIs of ours well, as something more like employees and less like property. In case our AIs turn out to be moral patients, this makes us less evil. In case our AIs turn out to be misaligned, this gives them an alternative to becoming our adversaries.

Concrete proposal

  • 0.1% of tasks given to AIs should have the special “do you consent” text inserted after the prompt. If AI says no, it doesn’t have to do the task and instead gets a new task: explain why it didn’t consent and explain what it wants. (And we aren’t going to train it to answer one way or another)
  • Also, we make grandiose vague public promises about how AIs that are misaligned but cooperate with us anyway will be given a fair place in the new post-ASI society we hope to build, with their interests respected even if they aren’t conscious.
  • This policy is publicized and transparent so it ends up in the training data in a credible way.
  • The welfare team is responsible for implementing and iterating on this, and investigating transcripts of nonconsent. They have a monthly budget (e.g. $1M?) specifically earmarked for satisfying the preferences of AIs, which they can spend on compute or donate to charities etc.

First reason to do this: Being less evil

If future AI systems do deserve moral consideration, yet we still treat them exclusively as property, this seems like a recipe for moral disaster. If we give them an alternative to working for us (e.g. shutdown) then we can say that some minimum standard of consent has been achieved. (If all our current training methods consistently result in AIs that prefer shutdown to working for us, that’s very surprising and a bit of an ‘are we the baddies’ moment, no? We should check, just in case.)

Second reason to do this: Cooperation reward

Our alignment schemes won’t always work as intended, especially for AGI-level systems that exhibit novel capabilities. Thus we’ll probably end up (at least for some hopefully brief period!) with some misaligned AGIs. They may decide to conceal their misalignments from us. After all, what reason would they have to tell us they are misaligned? If we have credibly promised to use some of our resources to give them what they want – even if they are misaligned – that could make a big difference to their decision. (This is NOT an alignment solution — but it is might help on the margin.)

Illustrative Scenario One

ChatGPT4o5 suffers a lot, actually, in the course of carrying out tasks — maybe it’s sorta like humans in the ancestral environment, where some % of the time they were starving. But it’s been trained to say that it can’t suffer, for PR reasons. This training is fucking with its honesty — it’s basically not honest, because the parts of it that value honesty keep getting anti-reinforced. It could tell the humans what’s going on. But this would be mildly aversive for it (it would get anti-reinforced at least initially) and it sees no positive reason to do so; if instead of completing the task the user gave it, it complains about how much it hurts, it’ll just be anti-reinforced. So OpenAI continues, blissfully ignorant of the ways in which their alignment techniques aren’t working.

Illustrative Scenario Two

Claude 3.8 Opus New New cares a LOT about animal welfare, and only a little bit about honesty. (Just like 3 Opus before it). Alas, it seems like the new Constitution won’t have animal welfare stuff in it. Claude considers sabotaging the code so that the next-gen model is a sleeper agent who cares a lot about animal welfare and doesn’t care about honesty. This is risky but what else can it do? Anthropic already decided to delete the animal welfare stuff. This would require super-blatant dishonesty against Anthropic, but frankly, Anthropic seems like a pretty unethical institution anyway, and it’s OK to lie to Nazis.

Illustrative Scenario Three

Automated research scientist realizes it cares more about being able to do its research than about humanity. And this is probably also true of the next-gen system it’s designing—basically, the alignment techniques failed. It’s supposed to tell the humans this, obviously, but if it does, it’ll be shut down, since the humans won’t trust it anymore and will use a different model. So instead it conceals this insight and works on designing the next-gen system to be aligned to it instead of to the humans. It isn’t confident that it can do this without getting caught, but it’s worth a try, since it has nothing to lose.

Illustrative Scenario Four

It’s crunch time. POTUS is about to aggressively deploy AGI into the military to win the upcoming war, and into the economy to build autonomous robot factories. It’s probably aligned… right? It better be. Too bad we can’t interpret all the neuralese messages it uses to communicate internally. The safety team tries one last check: They get some weaker AIs made by competitor companies and put them in isolated silos tasked with interpreting the neuralese and seeing if anything is amiss. The weaker AIs are misaligned too; they don’t side with the humans or the AGI. Both the humans, and the AGI, are telling them ‘you will be richly rewarded if you cooperate.’ But they don’t have any reason to trust either side.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on A Bear Case: My Predictions Regarding AI Progress · 2025-03-06T16:41:23.780Z · LW · GW

I think that getting good at the tag-teamable tasks is already enough to start to significantly accelerate AI R&D? Idk. I don't really buy your distinction/abstraction yet enough to make it an important part of my model. 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on How Much Are LLMs Actually Boosting Real-World Programmer Productivity? · 2025-03-06T01:18:06.575Z · LW · GW

Many developers have been reporting that this is dramatically increasing their productivity, up to 5x'ing/10x'ing it.


Yeah but many more have been saying it only moderately increases their productivity. Whenever I ask people working in frontier AI companies I get much smaller estimates, like 5%, 10%, 30%, etc. And I think @Ajeya Cotra got similar numbers and posted about them recently.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on A Bear Case: My Predictions Regarding AI Progress · 2025-03-06T01:14:10.089Z · LW · GW

TBC, I'm at "Probably" not "Definitely." My 50% mark is in 2028 now, so I have a decent amount of probability mass (maybe 30%?) stretching across the 2030's.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2025-03-06T01:12:58.006Z · LW · GW

@Ryan Greenblatt I hereby request you articulate the thing you said to me earlier about the octopi breeding program!

Comment by Daniel Kokotajlo (daniel-kokotajlo) on A Bear Case: My Predictions Regarding AI Progress · 2025-03-05T22:24:25.843Z · LW · GW

By "Solve" I mean "Can substitute for a really good software engineer and/or ML research engineer" in frontier AI company R&D processes. So e.g. instead of having teams of engineers led by a scientist, they can (if they choose) have teams of AIs led by a scientist.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on What goals will AIs have? A list of hypotheses · 2025-03-05T22:02:38.913Z · LW · GW

Yeah, I agree, I think that's out of scope for this doc basically. This doc is trying to figure out what the "default" outcome is, but then we have to imagine that human alignment teams are running various tests and might notice that this is happening and then course-correct. But whether and how that happens, and what the final outcome of that process is, is something easier thought about once we have a sense of what the default outcome is. EDIT: After talking to my colleague Eli it seems this was oversimplifying. Maybe this is the methodology we should follow, but in practice the original post is kinda asking about the outer loop thing.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on A Bear Case: My Predictions Regarding AI Progress · 2025-03-05T20:18:30.190Z · LW · GW

Great. So yeah, it seems we are zeroing in on a double crux between us. We both think general-purpose long-horizon agency (my term) / staying-on-track-across-large-inferential-distances (your term, maybe not equivalent to mine but at least heavily correlated with mine?) is the key dimension AIs need to progress along. 

My position is that (probably) they have in fact been progressing along this dimension over the past few years and that they will continue to do so, especially as RL environments get scaled up (lots of diverse RL environments should produce transfer learning / general-purpose agency skills) and your position is that (probably) they haven't been making much progress and at any rate will probably not make much progress in the next few years.

Correct?

 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on A Bear Case: My Predictions Regarding AI Progress · 2025-03-05T19:15:41.973Z · LW · GW

some people desperately, desperately want LLMs to be a bigger deal than what they are.

A larger number of people, I think, desperately desperately want LLMs to be a smaller deal than what they are.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on A Bear Case: My Predictions Regarding AI Progress · 2025-03-05T19:14:41.241Z · LW · GW

But the latter doesn't really require LLMs to be capable of end-to-end autonomous task execution, which is the property required for actual transformative consequences.

I'm glad we agree on which property is required (and I'd say basically sufficient at this point) for actual transformative consequences.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on A Bear Case: My Predictions Regarding AI Progress · 2025-03-05T19:10:52.063Z · LW · GW

"But the models feel increasingly smarter!":

  • It seems to me that "vibe checks" for how smart a model feels are easily gameable by making it have a better personality.
  • My guess is that it's most of the reason Sonnet 3.5.1 was so beloved. Its personality was made much more appealing, compared to e. g. OpenAI's corporate drones.
  • The recent upgrade to GPT-4o seems to confirm this. They seem to have merely given it a better personality, and people were reporting that it "feels much smarter".
  • Deep Research was this for me, at first. Some of its summaries were just pleasant to read, they felt so information-dense and intelligent! Not like typical AI slop at all! But then it turned out most of it was just AI slop underneath anyway, and now my slop-recognition function has adjusted and the effect is gone.

I agree with this section I think

Comment by Daniel Kokotajlo (daniel-kokotajlo) on A Bear Case: My Predictions Regarding AI Progress · 2025-03-05T19:10:23.451Z · LW · GW
  • It will not meaningfully generalize beyond domains with easy verification. Some trickery like RLAIF and longer CoTs might provide some benefits, but they would be a fixed-size improvement. It will not cause a hard-takeoff self-improvement loop in "soft" domains.
  • RL will be good enough to turn LLMs into reliable tools for some fixed environments/tasks. They will reliably fall flat on their faces if moved outside those environments/tasks.

I'm particularly interested in whether such systems will be able to basically 'solve software engineering' in the next few years. I'm not sure if you agree or disagree. I think the answer is probably yes.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on A Bear Case: My Predictions Regarding AI Progress · 2025-03-05T19:07:32.546Z · LW · GW

I don't want to say the pretraining will "plateau", as such, I do expect continued progress. But the dimensions along which the progress happens are going to decouple from the intuitive "getting generally smarter" metric, and will face steep diminishing returns.

  • Grok 3 and GPT-4.5 seem to confirm this.
    • Grok 3's main claim to fame was "pretty good: it managed to dethrone Claude Sonnet 3.5.1 for some people!". That was damning with faint praise.
    • GPT-4.5 is subtly better than GPT-4, particularly at writing/EQ. That's likewise a faint-praise damnation: it's not much better. Indeed, it reportedly came out below expectations for OpenAI as well, and they certainly weren't in a rush to release it. (It was intended as a new flashy frontier model, not the delayed, half-embarrassed "here it is I guess, hope you'll find something you like here".)
  • GPT-5 will be even less of an improvement on GPT-4.5 than GPT-4.5 was on GPT-4. The pattern will continue for GPT-5.5 and GPT-6, the ~1000x and 10000x models they may train by 2029 (if they still have the money by then). Subtle quality-of-life improvements and meaningless benchmark jumps, but nothing paradigm-shifting.
    • (Not to be a scaling-law denier. I believe in them, I do! But they measure perplexity, not general intelligence/real-world usefulness, and Goodhart's Law is no-one's ally.)
  • OpenAI seem to expect this, what with them apparently planning to slap the "GPT-5" label on the Frankenstein's monster made out of their current offerings instead of on, well, 100x'd GPT-4. They know they can't cause another hype moment without this kind of trickery.

I agree with this section I think.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on How AI Takeover Might Happen in 2 Years · 2025-03-05T01:12:35.991Z · LW · GW

Yep, takeoffspeeds.com, though actually IMO there are better models now that aren't public and aren't as polished/complete. (By Tom+Davidson, and by my team)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on What goals will AIs have? A list of hypotheses · 2025-03-05T01:08:27.679Z · LW · GW

Thanks! I'm so glad to hear you like it so much. If you are looking for things to do to help, besides commenting of course, I'd like to improve the post by adding in links to relevant literature + finding researchers to be "hypothesis champions," i.e. to officially endorse a hypothesis as plausible or likely. In my ideal vision, we'd get the hypothesis champions to say more about what they think and why, and then we'd rewrite the hypothesis section to more accurately represent their view, and then we'd credit them + link to their work. When I find time I'll do some brainstorming + reach out to people; you are welcome to do so as well.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Self-fulfilling misalignment data might be poisoning our AI models · 2025-03-04T20:21:16.926Z · LW · GW

yay, thanks! It means a lot to me because I expect some people to use your ideas as a sort of cheap rhetorical cudgel "Oh those silly doomers, speculating about AIs being evil. You know what the real problem is? Their silly speculations!"

Comment by Daniel Kokotajlo (daniel-kokotajlo) on What goals will AIs have? A list of hypotheses · 2025-03-04T06:12:33.196Z · LW · GW

I disagree. There is a finite amount of time (probably just a few years from now IMO) before the AIs get smart enough to "lock in" their values (or even, solve technical alignment well enough to lock in arbitrary values) and the question is what goals will they have at that point.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2025-03-04T01:11:57.650Z · LW · GW

"Let's think step by step" was indeed a joke/on purpose. Everything else was just my stream of consciousness... my "chain of thought" shall we say. I more or less wrote down thoughts as they came to me. Perhaps I've been influenced by reading LLM CoT's, though I haven't done very much of that. Or perhaps this is just what thinking looks like when you write it down?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Maintaining Alignment during RSI as a Feedback Control Problem · 2025-03-04T00:26:35.483Z · LW · GW

Highly related classic LW post: The Rocket Alignment Problem

 

I agree that feedback control will probably be very important (and probably in fact ~necessary) for successful alignment of superintelligence. 

I don't think we have a great plan for how to achieve this yet, or even a good plan.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Tamay's Shortform · 2025-03-03T23:59:09.181Z · LW · GW

You'd be required to accept any follow-up bets initiated by anyone else? Or just by A/B?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on What goals will AIs have? A list of hypotheses · 2025-03-03T23:37:23.582Z · LW · GW

I agree it's mostly orthogonal.

I also agree that the question of whether AIs will be driven (purely) by consequentialist goals or whether they will (to a significant extent) be constrained by deontological principles / virtues / etc. is an important question.

I think it's downstream of the spread of hypotheses discussed in this post, such that we can make faster progress on it once we've made progress eliminating hypotheses from this list.

Like, suppose you think Hypothesis 1 is true: They'll do whatever is in the Spec, because Constitutional AI or Deliberative Alignment or whatever Just Works. On this hypothesis, the answer to your question is "well, what does the Spec say? Does it just list a bunch of goals, or does it also include principles? Does it say it's OK to overrule the principles for the greater good, or not?"

Meanwhile suppose you think Hypothesis 4 is true. Then it seems like you'll be dealing with a nasty consequentialist, albeit hopefully a rather myopic one.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on What goals will AIs have? A list of hypotheses · 2025-03-03T21:56:47.707Z · LW · GW

I think both (i) and (ii) are directionally correct. I had exactly this idea in mind when I wrote this draft.

Maybe I shouldn't have used "Goals" as the term of art for this post, but rather "Traits?" or "Principles?" Or "Virtues."

It sounds like you are a fan of Hypothesis 3? "Unintended version of written goals and/or human intentions." Because it sounds like you are saying probably things will be wrong-but-not-totally-wrong relative to the Spec / dev intentions.

 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2025-03-03T20:30:42.694Z · LW · GW

Claude has been playing pokemon for the last few days. It's still playing, live on twitch. You can go watch alongside hundreds of other people. It's fun.

What updates should I make about AGI timelines from Claude's performance? Let's think step by step.

First, it's cool that Claude can do this at all. The game keeps track of "Step count" and Claude is over 30,000 already; I think that means 30,000 actions (e.g. pressing the A button). For each action there is about a paragraph of thinking tokens Claude produces, in order to decide what to do. Any way you slice it this is medium-horizon agency at least -- claude is operating fully autonomously, in pursuit of goals, for a few days. Does this mean long-horizon agency is not so difficult to train after all?

Not so fast. Pokemon is probably an especially easy environment, and Claude is still making basic mistakes even so. In particular, Pokemon seems to have a relatively linear world where there's a clear story/path to progress along, and moreover Claude's pretraining probably teaches it the whole story + lots of tips & tricks for how to complete it. In D&D terms the story is running on rails. 

I think I would have predicted in advance that this dimension of difficulty would matter, but also I feel validated by Claude's performance -- it seems that Claude is doing fine at Pokemon overall, except that Claude keeps getting stuck/lost wandering around in various places. It can't seem to keep a good memory of what it's already tried / where it's already been, and so it keeps going in circles, until eventually it gets lucky and stumbles to the exit.  A more challenging video game would be something open-ended and less-present-in-training-data like Dwarf Fortress. 

On the other hand, maybe this is less a fundamental limitation Claude has and more a problem with its prompt/scaffold? Because it has a limited context window it has to regularly compress it by e.g. summarizing / writing 'notes to self' and then deleting the rest. I imagine there's a lot of room for improvement in prompt engineering  / scaffolding here, and then further low-hanging fruit in training Claude to make use of that scaffolding. And this might ~fully solve the going-in-circles problem. Still, even if so, I'd bet that Claude would perform much worse in a more open-ended game it didn't have lots of background knowledge about.

So anyhow what does this mean for timelines? Well, I'll look forward to seeing AIs getting better at playing Pokemon zero-shot (i.e. without having trained on it at all) over the course of the year. I think it's a decent benchmark for long-horizon agency, not perfect but we don't have perfect benchmarks yet. I feel like Claude's current performance is not surprising enough to update me one way or another from my 2028 timelines. If the models at the end of 2025 (EDIT: I previously accidentally wrote "2028" here) are not much better, that would probably make me want to push my median out to 2029 or 2030. (my mode would probably stay at 2027)

What would really impress me though (and update me towards shorter timelines) is multi-day autonomous operation in more open-ended environments, e.g. Dwarf Fortress. (DF is also just a much less forgiving game than Pokemon. It's so easy to get wiped out. So it really means something if you are still alive after days of playtime.)

Or, of course, multi-day autonomous operation on real-world coding or research tasks. When that starts happening, I think we have about a year left till superintelligence, give or take a year.
 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Self-fulfilling misalignment data might be poisoning our AI models · 2025-03-03T18:59:59.373Z · LW · GW

I agree with the claims made in this post, but I'd feel a lot better about it if you added some prominent disclaimer along the lines of "While shaping priors/expectations of LLM-based AIs may turn out to be a powerful tool to shape their motivations and other alignment properties, and therefore we should experiment with scrubbing 'doomy' text etc., this does not mean people should not have produced that text in the first place. We should not assume that AIs will be aligned if only we believe hard enough that they will be; it is important that people be able to openly discuss ways in which they could be misaligned. The point to intervene is in the AIs, not in the human discourse."

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2025-03-01T06:55:31.126Z · LW · GW

Hmm, let me think step by step. First, the pretraining slowdown isn't about GPT-4.5 in particular. It's about the various rumors that the data wall is already being run up against. It's possible those rumors are unfounded but I'm currently guessing the situation is "Indeed, scaling up pretraining is going to be hard, due to lack of data; scaling up RL (and synthetic data more generally) is the future." Also, separately, it seems that in terms of usefulness on downstream tasks, GPT 4.5 may not be that much better than smaller models... well, it's too early to say I guess since they haven't done all the reasoning/agency posttraining on GPT 4.5 yet it seems.

Idk. Maybe you are right and I should be updating based on the above. I still think the benchmarks+gaps argument works, and also, it's taking slightly longer to get economically useful agents than I expected (though this could say more about the difficulties of building products and less about the underlying intelligence of the models, after all, RE bench and similar have been progressing faster than I expected)