Posts

Why Don't We Just... Shoggoth+Face+Paraphraser? 2024-11-19T20:53:52.084Z
Self-Awareness: Taxonomy and eval suite proposal 2024-02-17T01:47:01.802Z
AI Timelines 2023-11-10T05:28:24.841Z
Linkpost for Jan Leike on Self-Exfiltration 2023-09-13T21:23:09.239Z
Paper: On measuring situational awareness in LLMs 2023-09-04T12:54:20.516Z
AGI is easier than robotaxis 2023-08-13T17:00:29.901Z
Pulling the Rope Sideways: Empirical Test Results 2023-07-27T22:18:01.072Z
What money-pumps exist, if any, for deontologists? 2023-06-28T19:08:54.890Z
The Treacherous Turn is finished! (AI-takeover-themed tabletop RPG) 2023-05-22T05:49:28.145Z
My version of Simulacra Levels 2023-04-26T15:50:38.782Z
Kallipolis, USA 2023-04-01T02:06:52.827Z
Russell Conjugations list & voting thread 2023-02-20T06:39:44.021Z
Important fact about how people evaluate sets of arguments 2023-02-14T05:27:58.409Z
AI takeover tabletop RPG: "The Treacherous Turn" 2022-11-30T07:16:56.404Z
ACT-1: Transformer for Actions 2022-09-14T19:09:39.725Z
Linkpost: Github Copilot productivity experiment 2022-09-08T04:41:41.496Z
Replacement for PONR concept 2022-09-02T00:09:45.698Z
Immanuel Kant and the Decision Theory App Store 2022-07-10T16:04:04.248Z
Forecasting Fusion Power 2022-06-18T00:04:34.334Z
Why agents are powerful 2022-06-06T01:37:07.452Z
Probability that the President would win election against a random adult citizen? 2022-06-01T20:38:44.197Z
Gradations of Agency 2022-05-23T01:10:38.007Z
Deepmind's Gato: Generalist Agent 2022-05-12T16:01:21.803Z
Is there a convenient way to make "sealed" predictions? 2022-05-06T23:00:36.789Z
Are deference games a thing? 2022-04-18T08:57:47.742Z
When will kids stop wearing masks at school? 2022-03-19T22:13:16.187Z
New Year's Prediction Thread (2022) 2022-01-01T19:49:18.572Z
Interlude: Agents as Automobiles 2021-12-14T18:49:20.884Z
Agents as P₂B Chain Reactions 2021-12-04T21:35:06.403Z
Agency: What it is and why it matters 2021-12-04T21:32:37.996Z
Misc. questions about EfficientZero 2021-12-04T19:45:12.607Z
What exactly is GPT-3's base objective? 2021-11-10T00:57:35.062Z
P₂B: Plan to P₂B Better 2021-10-24T15:21:09.904Z
Blog Post Day IV (Impromptu) 2021-10-07T17:17:39.840Z
Is GPT-3 already sample-efficient? 2021-10-06T13:38:36.652Z
Growth of prediction markets over time? 2021-09-02T13:43:38.869Z
What 2026 looks like 2021-08-06T16:14:49.772Z
How many parameters do self-driving-car neural nets have? 2021-08-06T11:24:59.471Z
Two AI-risk-related game design ideas 2021-08-05T13:36:38.618Z
Did they or didn't they learn tool use? 2021-07-29T13:26:32.031Z
How much compute was used to train DeepMind's generally capable agents? 2021-07-29T11:34:10.615Z
DeepMind: Generally capable agents emerge from open-ended play 2021-07-27T14:19:13.782Z
What will the twenties look like if AGI is 30 years away? 2021-07-13T08:14:07.387Z
Taboo "Outside View" 2021-06-17T09:36:49.855Z
Vignettes Workshop (AI Impacts) 2021-06-15T12:05:38.516Z
ML is now automating parts of chip R&D. How big a deal is this? 2021-06-10T09:51:37.475Z
What will 2040 probably look like assuming no singularity? 2021-05-16T22:10:38.542Z
How do scaling laws work for fine-tuning? 2021-04-04T12:18:34.559Z
Fun with +12 OOMs of Compute 2021-03-01T13:30:13.603Z
Poll: Which variables are most strategically relevant? 2021-01-22T17:17:32.717Z

Comments

Comment by Daniel Kokotajlo (daniel-kokotajlo) on What Indicators Should We Watch to Disambiguate AGI Timelines? · 2025-01-29T04:13:18.238Z · LW · GW

Have you looked at samples of CoT of o1, o3, deepseek, etc. solving hard math problems? I feel like a few examples have been shown & they seem to involve qualitative thinking, not just brute-force-proof-search (though of course they show lots of failed attempts and backtracking -- just like a human thought-chain would).

Anyhow, this is nice, because I do expect that probably something like this milestone will be reached before AGI (though I'm not sure)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Comp Sci in 2027 (Short story by Eliezer Yudkowsky) · 2025-01-29T01:44:08.471Z · LW · GW

I forgot about this one! It's so great! Yudkowsky is a truly excellent fiction writer. I found myself laughing multiple times reading this + some OpenAI capabilities researchers I know were too. And now rereading it... yep it stands the test of time.

I came back to this because I was thinking about how hopeless the situation w.r.t. AGI alignment seems and then a voice in my head said "it could be worse, remember the situation described in that short story?"

Comment by Daniel Kokotajlo (daniel-kokotajlo) on What Indicators Should We Watch to Disambiguate AGI Timelines? · 2025-01-29T01:36:38.175Z · LW · GW

OK. Next question: Suppose that next year we get a nice result showing that there is a model with serial inference-time scaling across e.g. MATH + FrontierMath + IMO problems. Recall that FrontierMath and IMO are subdivided into different difficulty levels; suppose that this model can be given e.g. 10 tokens of CoT, 100, 1000, 10,000, etc. and then somewhere around the billion-serial-token-level it starts solving a decent chunk of the "medium" FrontierMath problems (but not all) and at the million-serial-token level it was only solving MATH + some easy IMO problems.

Would this count, for you?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on What Indicators Should We Watch to Disambiguate AGI Timelines? · 2025-01-29T01:00:01.904Z · LW · GW

Nice.

What about "Daniel Kokotajlo can feed it his docs about some prosaic ML alignment agenda (e.g. the faithful CoT stuff) and then it can autonomously go off and implement the agenda and come back to him with a writeup of the results and takeaways. While working on this, it gets to check in with Daniel once a day for a brief 20-minute chat conversation."

Does that seem to you like it'll come earlier, or later, than the milestone you describe?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2025-01-28T23:06:15.544Z · LW · GW

Brief thoughts on Deliberative Alignment in response to being asked about it

  • We first train an o-style model for helpfulness, without any safety-relevant data. 
  • We then build a dataset of (prompt, completion) pairs where the CoTs in the completions reference the specifications. We do this by inserting the relevant safety specification text for each conversation in the system prompt, generating model completions, and then removing the system prompts from the data.
  • We perform incremental supervised fine-tuning (SFT) on this dataset, providing the model with a strong prior for safe reasoning. Through SFT, the model learns both the content of our safety specifications and how to reason over them to generate aligned responses.
  • We then use reinforcement learning (RL) to train the model to use its CoT more effectively. To do so, we employ a reward model with access to our safety policies to provide additional reward signal.

My summary as I currently understand it:

  1. Pretrain
  2. Helpful-only Agent (Instruction-following and Reasoning/Agency Training)
  3. Deliberative Alignment
    1. Create your reward model: Take your helpful-only agent from step 2 and prompt it with “is this an example of following these rules [insert spec]?” or something like that
    2. Generate some attempted spec-followings: Take your helpful-only agent from step 2 and prompt it with “Follow these rules [insert spec], think step by step in your CoT before answering.” Feed it a bunch of example user inputs.
    3. Evaluate the CoT generations using your reward model, and then train another copy of helpful-only agent from 2 on that data using SFT to distill the highest-evaluated CoTs into it. (removing the spec part of the prompt) That way, it doesn’t need to have the spec in the context window anymore, and also, it ‘reasons correctly’ about the spec (i.e. reasons in ways the RM would have most approved of)
    4. Take the resulting agent and do RL on it using the same RM from B. This time, the RM doesn’t get to see the CoT.
  4. Deploy the resulting agent.

So there’s a chicken and egg problem here. You are basically using a model to grade/evaluate/train itself. Obvious problem is that if the initial model is misaligned, the resulting model might also be misaligned. (In fact it might even be *more* misaligned).

I can imagine someone responding “no problem we’ll just do the same thing but with a different model as the reward model, perhaps a more trusted model.” So let me break down three categories of strategy:

Train your smart agent with evaluations produced by humans and/or dumber modelsSevere oversight failures + Inductive bias worries + possible confusing drift dynamics
Train your smart agent with evaluations produced by different-but-similarly-capable models.Less severe oversight failures +  Inductive bias worries + mild chicken-and-egg problem + possible confusing drift dynamics
Train your smart agent with evaluations produced by itself (or something similar enough to be relevantly similar to itself)Severe chicken-and-egg problem + possible confusing drift dynamics

See below for explanation of each term/phrase.

I’m most excited about the second kind of strategy (which includes IDA and W2SG for example). Deliberative Alignment and Constitutional AI seem to be either the first or the third.

  • Severe oversight failures: The smart agent will often find itself having to consciously and/or unconsciously choose between being aligned and getting reinforced, e.g. when it notices opportunities to lie, manipulate, or cheat and get away with it. See basically all the classic literature e.g. Ajeya Cotra's training game report.
  • Less severe oversight failures: Being overseen by an equally-competent agent helps with a lot of things, but doesn't completely solve the oversight failure problem -- e.g. I might be just as competent as you, but since I can't read your mind I can't tell if you are lying. I might not even be able to tell if the code you are writing contains a security vulnerability or an applause light (i.e. something that isn't actually good code but which appears to human raters to be good code) though I'm much more likely to be able to tell than if I was dumber than you.
  • Severe chicken-and-egg problem: You don't really have oversight failures if the overseer is exactly or almost exactly the same model. Because e.g. it knows exactly what the agent model was thinking when it output X, so it knows if the agent model was being dishonest. (Though now that I say that out loud, I realize it's not strictly true, there could still be oversight failures. But they seem much less severe at least.) However, the chicken and egg problem is severe. Somehow you need to get the overseer-variant of the model to e.g. do its best to penalize dishonesty, penalize cheating, etc. despite not itself being disinclined to lie and cheat! (If it already was discinclined to lie and cheat, then great, but how did that happen?) Perhaps this can work e.g. if different copies of the same model have different goals (perhaps because they are selfish / have indexical preferences).
  • Mild chicken-and-egg problem: When you are training a new-generation AI system that is more capable than your current stuff, how do you align it, if your alignment strategy involves using similarly-capable models to do the evaluation? This paradox may have some answers; perhaps W2SG is an example of one. (Use the non-situationally-aware base model!) Another possible answer is IDA maybe (use a slightly earlier version of the model but with more time to think / more scaffolding)
  • Inductive bias worries: If your Spec is complicated enough, the model might  internalize the Spec in the world-model rather than in the goal-slot, where the thing in the goal-slot is something simpler than the Spec. If you are lucky it'll be a pointer to the Spec that is robust to future distribution shifts; if you aren't, it'll be a nonrobust pointer, if you are extra unlucky it'll be some other goalset (including proxies, ICGs, and shards or variants of the Spec) combined with instrumental convergence. Ditto for your Intentions. See the mesaoptimizers paper "Risks from Learned Optimization." (Also I think this doesn't actually depend on there being a goal slot, the goal slot model is just a helpful simplification.)
  • Possible confusing drift dynamics: This chain of AIs training AIs training AIs... seems like the sort of thing that could have all sorts of interesting or even chaotic drift dynamics. I don't have much to say here other than that. If we understood these drift dynamics better perhaps we could harness them to good effect, but in our current state of ignorance we need to add this to the list of potential problems.
Comment by Daniel Kokotajlo (daniel-kokotajlo) on Six Thoughts on AI Safety · 2025-01-27T03:45:32.584Z · LW · GW

Can you say more about how alignment is crucial for usefulness of AIs? I'm thinking especially of AIs that are scheming / alignment faking / etc.; it seems to me that these AIs would be very useful -- or at least would appear so -- until it's too late.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Six Thoughts on AI Safety · 2025-01-25T18:44:46.969Z · LW · GW

The bottom line is not that we are guaranteed safety, nor that unaligned or misaligned superintelligence could not cause massive harm— on the contrary. It is that there is no single absolute level of intelligence above which the existence of a misaligned intelligence with this level spells doom. Instead, it is all about the world in which this superintelligence will operate, the goals to which other superintelligent systems are applied, and our mechanisms to ensure that they are indeed working towards their specified goals.

I agree that the vulnerable world hypothesis is probably false and that if we could only scale up to superintelligence in parallel across many different projects / nations / factions, such that the power is distributed, AND if we can make it so that most of the ASIs are aligned at any given time, things would probably be fine.

However, it seems to me that we are headed to a world with much higher concentration of AI power than that. Moreover, it's easier to create misaligned AGI than to create aligned AGI, so at any given time the most powerful AIs will be misaligned--the companies making aligned AGIs will be going somewhat slower, taking a few hits to performance, etc.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Six Thoughts on AI Safety · 2025-01-25T18:40:38.343Z · LW · GW

But we already align complex systems, whether it’s corporations or software applications, without complete “understanding,” and do so by ensuring they meet certain technical specifications, regulations, or contractual obligations.

  1. We currently have about as much visibility into corporations as we do into large teams of AIs, because both corporations and AIs use english CoT to communicate internally. However, I fear that in the future we'll have AIs using neuralese/recurrence to communicate with their future selves and with each other.
  2. History is full of examples of corporations being 'misaligned' to the governments that in some sense created and own them. (and also to their shareholders, and also to the public, etc. Loads of examples of all kinds of misalignments). Drawing from this vast and deep history, we've evolved institutions to deal with these problems. But with AI, we don't have that history yet, we are flying (relatively) blind.
  3. Moreover, ASI will be qualitatively smarter and than any corporation ever has been.
  4. Moreover, I would say that our current methods for aligning corporations only work as well as they do because the corporations have limited power. They exist in a competitive market with each other, for example. And they only think at the same speed as the governments trying to regulate them. Imagine a corporation that was rapidly growing to be 95% of the entire economy of the USA... imagine further that it is able to make its employees take a drug that makes them smarter and think orders of magnitude faster... I would be quite concerned that the government would basically become a pawn of this corporation. The corporation would essentially become the state. I worry that by default we are heading towards a world where there is a single AGI project in the lead, and that project has an army of ASIs on its datacenters, and the ASIs are all 'on the same team' because they are copies of each other and/or were trained in very similar ways... 
Comment by Daniel Kokotajlo (daniel-kokotajlo) on Six Thoughts on AI Safety · 2025-01-25T18:30:04.387Z · LW · GW

What we want is reasonable compliance in the sense of:

  1. Following the specification precisely when it is clearly defined.
  2. Following the spirit of the specification in a way that humans would find reasonable in other cases.

 

This section on reasonable compliance (as opposed to love humanity etc.) is perhaps the most interesting and important. I'd love to have a longer conversation with you about it sometime if you are up for that.

Two things to say for now. First, as you have pointed out, there's a spectrum between vague general principles like 'do what's right' and 'love humanity' 'be reasonable' and 'do what normal people would want you to do in this situation if they understood it as well as you do' on the one end, and then thousand-page detailed specs / constitutions / flowcharts on the other end. But I claim that the problems that arise on each end of the spectrum don't go away if you are in the middle of the spectrum, they just lessen somewhat. Example: On the "thousand page spec" end of the spectrum, the obvious problem is 'what if the spec has unintended consequences / loopholes / etc.?" If you go to the middle of the spectrum and try something like Reasonable Compliance, this problem remains but in weakened form: 'what if the clearly-defined parts of the spec have unintended consequences / loopholes / etc.?' Or in other words, 'what if every reasonable interpretation of the Spec says we must do X, but X is bad?' This happens in Law all the time, even though the Law does include for flexible vague terms like 'reasonableness' in its vocabulary. 

Second point. Making an AI be reasonably compliant (or just compliant) instead of Good, means you are putting less trust in the AI's philosophical reasoning / values / training process / etc. but more trust in the humans who get to write the Spec. Said humans had better be high-integrity and humble, because they will be tempted in a million ways to abuse their power and put things in the Spec that essentially make the AI a reflection of their own ideosyncratic values -- or worse, essentially making the AI their own loyal servant instead of making it serve everyone equally. (If we were in a world with less concentration of AI power, this wouldn't be so bad -- in fact arguably the best outcome is 'everyone gets their own ASI aligned to them specifically.' But if there is only one leading ASI project, with only a handful of people at the top of the hierarchy owning the thing... then we are basically on track to create a dictatorship or oligarchy.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Six Thoughts on AI Safety · 2025-01-25T18:07:40.714Z · LW · GW

Constant instead of temporal allocation. I do agree that as capabilities grow, we should be shifting resources to safety. But rather than temporal allocation (i.e., using AI for safety before using it for productivity), I believe we need constant compute allocation: ensuring a fixed and sufficiently high fraction of compute is always spent on safety research, monitoring, and mitigations.


I think we should be cranking up the compute allocation now, and also we should be making written safety case sketches & publishing them for critique by the scientific community, and also if the safety cases get torn to shreds such that a reasonable disinterested expert would conclude 'holy shit this thing is not safe, it plausibly is faking alignment already and/or inclined to do so in the future' then we halt internal deployment and beef up our control measures / rebuild with a different safer design / etc.  Does not feel like too much to ask, given that everyone's lives are on the line.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Six Thoughts on AI Safety · 2025-01-25T05:41:36.978Z · LW · GW

We can’t just track a single outcome (like “landed safely”). The G in AGI means that the number of ways that AGI can go wrong is as large as the number of ways that applications of human intelligence can go wrong, which include direct physical harm from misuse, societal impacts through misinformation, social upheaval from too fast changes, AIs autonomously causing harm and more.

I do agree with this, but I think that there are certain more specific failure modes that are especially important -- they are especially bad if we run into them, but if we can avoid them, then we are in a decent position to solve all the other problems. I'm thinking primarily of the failure mode where your AI is pretending to be aligned instead of actually aligned. This failure mode can arise fairly easily if (a) you don't have the interpretability tools to reliably tell the difference, and (b) inductive biases favor something other than the goals/principles you are trying to train in OR your training process is sufficiently imperfect that the AI can score higher by being misaligned than by being aligned. And both a and b seem like they are plausibly true now and will plausibly be true for the next few years. (For more on this, see this old report and this recent experimental result) If we can avoid this failure mode, we can stay in the regime where iterative development works and figure out how to align our AIs better & then start using them to do lots of intellectual work to solve all the other problems one by one in rapid succession. (The good attractor state)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Six Thoughts on AI Safety · 2025-01-25T05:29:39.228Z · LW · GW

Safety and alignment are AI capabilities

 

I think I see what you are saying here but I just want to flag this is a nonstandard use of terms. I think the standard terminology would contrast capabilities and propensities; 'can it do the thing, if it tried' vs. 'would it ever try.' And alignment is about propensity (though safety is about both).

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Six Thoughts on AI Safety · 2025-01-25T05:27:29.186Z · LW · GW

Thanks for taking the time to think and write about this important topic!

Here are some point-by-point comments as I read:

(Though I suspect translating these technical capabilities to the economic and societal impact we associate with AGI will take significantly longer.) 

I think it'll take an additional 0 to 5 years roughly. More importantly though, I think that the point to intervene on -- the time when the most important decisions are being made -- is right around the time of AGI. By the time you have ASI, and certainly by the time you are deploying ASI into the economy, you've probably fallen into one of the two stable attractor states I describe here. Which one you fall into depends on choices made earlier, e.g. how much alignment talent you bring into the project, the extent to which that talent is optimistic vs. appropriately paranoid, the time you give them to let them cook with the models, the resources you give them (% of total compute, say in overall design strategy), etc.

This assumes that our future AGIs and ASIs will be, to a significant extent, scaled-up versions of our current models. On the one hand, this is good news, since it means our learnings from current models are relevant for more powerful ones, and we can develop and evaluate safety techniques using them. On the other hand, this makes me doubt that safety approaches that do not show signs of working for our current models will be successful for future AIs.

I agree that future AGIs and ASIs will be to a significant extent scaled up versions of current models (at least at first; I expect the intelligence explosion to rapidly lead to additional innovations and paradigm shifts). I'm not sure what you are saying with the other sentences. Sometimes when people talk about current alignment techniques working, what they mean is 'causes current models to be better at refusals and jailbreak resistance' which IMO is a tangentially related but importantly different problem from the core problem(s) we need to solve in order to end up in the good attractor state. After all, you could probably make massive progress on refusals and jailbreaks simply by making the models smarter, without influencing their goals/values/principles at all.

Oh wait I just remembered I can comment directly on the text with a bunch of little comments instead of making one big comment here -- I'll switch to that henceforth.

Cheers!

Comment by Daniel Kokotajlo (daniel-kokotajlo) on MONA: Managed Myopia with Approval Feedback · 2025-01-23T22:56:35.141Z · LW · GW

Thanks this is helpful. Is MONA basically "Let's ONLY use process-based feedback, no outcome-based feedback?" 

Another objection: If this works for capabilities, why haven't the corporations done it already? It seems like it should be a super scalable way to make a computer-using agent work.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on MONA: Managed Myopia with Approval Feedback · 2025-01-23T16:47:08.231Z · LW · GW

Interesting stuff! Could you please sketch out what it would look like to use MONA for a general-purpose computer-using AI agent? Such as what the first AGI systems might look like?

If I understand correctly (and I don't think I do, which is why I'm asking) MONA in this setting would look something like:
 

Have our baby agent do lots of stuff on the computer. At each step, reward it based on how good its action looks to some supervisor-process. The supervisor-process doesn't get to see the actual consequences of the action, it just gets to e.g. do some CoT to speculate about whether the action is going to be good. So maybe the training run looks something like: 10T tokens of 1T 'basic actions in the environment' and then for each of those 1T 'basic actions' there's 10,000 tokens of CoT where three models (for, against, and judge) debate how good the action is given that context. And then the reward for the basic agent is straighforwardly based on the evaluations the judge gives.

Problem: Won't this training run basically hack the judges, and produce actions that look good to the judges but aren't actually useful for accomplishing tasks in the real world? (Maybe the idea is that above a certain level of basic capability, that won't be true? Also maybe we can do something like IDA where the judges are copies of the agent that get to think longer, and so as the agent improves, so do they?)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2025-01-23T16:23:50.555Z · LW · GW

Here is a brainstorm of the big problems that remain once we successfully get into the first attractor state:

  • Concentration of power / power grab risk. Liberal democracy does not work by preventing terrible people from getting into positions of power; it works by spreading out the power in a system of checks and balances and red tape transparency (free press, free speech) and term limits, that functions to limit what the terrible people can do in power. Once we get to ASI, the ASI project will determine the course of the future, not the traditional government+press system.  (Because the ASI project will be able to easily manipulate those traditional institutions if it wants to) So somehow we need to design the governance structure of the ASI project to have similar checks and balances etc. as liberal democracy -- because by default the governance structure of the ASI project will be akin to an authoritarian dictatorship, just like most companies are and just like the executive branch (considered in isolation) is. Otherwise, we are basically crossing our fingers and hoping that the men in charge of the project will be humble, cosmopolitan, benevolent, etc. and devolve power instead of abusing it.
  • S-risk. This is related to the above but distinct from it. I'm quite worried about this actually.
  • ...actually everything else is a distant second as far as I can tell (terrorist misuse, China winning, wealth inequality, philosophical mistakes... or a distant distant third (wealth inequality, unemployment, meaning))
Comment by Daniel Kokotajlo (daniel-kokotajlo) on Understanding and controlling a maze-solving policy network · 2025-01-22T22:50:54.783Z · LW · GW

Yep seems right to me. Bravo!

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2025-01-22T17:35:56.964Z · LW · GW

Interesting, thanks for this. Hmmm. I'm not sure this distinction between internally modelling the whole problem vs. acting in feedback loops is helpful -- won't the AIs almost certainly be modelling the whole problem, once they reach a level of general competence not much higher than what they have now? They are pretty situationally aware already.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Training on Documents About Reward Hacking Induces Reward Hacking · 2025-01-22T01:37:57.817Z · LW · GW

I'm curious whether these results are sensitive to how big the training runs are. Here's a conjecture:

Early in RL-training (or SFT), the model is mostly 'playing a role' grabbed from the library of tropes/roles/etc. it learned from pretraining. So if it read lots of docs about how AIs such as itself tend to reward-hack, it'll reward-hack. And if it read lots of docs about how AIs such as itself tend to be benevolent angels, it'll be a stereotypical benevolent angel.

But if you were to scale up the RL training a lot, then the initial conditions would matter less, and the long-run incentives/pressures/etc. of the RL environment would matter more. In the limit, it wouldn't matter what happened in pretraining, the end result would be the same.

A contrary conjecture would be that there is a long-lasting 'lock in' or 'value crystallization' effect, whereby tropes/roles/etc. picked up from pretraining end up being sticky for many OOMs of RL scaling. (Vaguely analogous to how the religion you get taught as a child does seem to 'stick' throughout adulthood)

Thoughts?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2025-01-21T19:13:48.521Z · LW · GW

Brief intro/overview of the technical AGI alignment problem as I see it:

To a first approximation, there are two stable attractor states that an AGI project, and perhaps humanity more generally, can end up in, as weak AGI systems become stronger towards superintelligence, and as more and more of the R&D process – and the datacenter security system, and the strategic advice on which the project depends – is handed over to smarter and smarter AIs.

In the first attractor state, the AIs are aligned to their human principals and becoming more aligned day by day thanks to applying their labor and intelligence to improve their alignment. The humans’ understanding of, and control over, what’s happening is high and getting higher.

In the second attractor state, the humans think they are in the first attractor state, but are mistaken: Instead, the AIs are pretending to be aligned, and are growing in power and subverting the system day by day, even as (and partly because) the human principals are coming to trust them more and more. The humans’ understanding of, and control over, what’s happening is low and getting lower. The humans may eventually realize what’s going on, but only when it’s too late – only when the AIs don’t feel the need to pretend anymore.

(One can imagine alternatives – e.g. the AIs are misaligned but the humans know this and are deploying them anyway, perhaps with control-based safeguards; or maybe the AIs are aligned but have chosen to deceive the humans and/or wrest control from them, but that’s OK because the situation calls for it somehow. But they seem less likely than the above, and also more unstable.)

Which attractor state is more likely, if the relevant events happen around 2027? I don’t know, but here are some considerations:

  • In many engineering and scientific domains, it’s common for something to seem like it’ll work when in fact it won’t. A new rocket design usually blows up in the air several times before it succeeds, despite lots of on-the-ground testing and a rich history of prior rockets to draw from, and pretty well-understood laws of physics. Code, meanwhile, almost always has bugs that need to be fixed. Presumably AI will be no different – and presumably, getting the goals/principles right will be no different.
  • This is doubly true since the process of loading goals/principles into a modern AI system is not straightforward. Unlike ordinary software, where we can precisely define the behavior we want, with modern AI systems we need to train it in and hope that what went in is what we hoped would go in, instead of something else that looks the same on-distribution but behaves differently in some yet-to-be-encountered environment. We can’t just check, because our AIs are black-box. (Though, that situation is improving thanks to interpretability research!) Moreover, the connection between goals/principles and behavior is not straightforward for powerful, situationally aware AI systems – even if they have wildly different goals/principles from what you wanted, they might still behave as if they had the goals/principles you wanted while still under your control. (c.f. Instrumental convergence, ‘playing the training game,’ alignment faking, etc.)
  • On the bright side, there are multiple independent alignment and control research agendas that are already bearing some fruit and which, if fully successful, could solve the problem – or at least, solve it well enough to get somewhat-superhuman AGI researchers that are trustworthy enough to trust with running our datacenters, giving us strategic advice, and doing further AI and alignment research.
  • Moreover, as with most engineering and scientific domains, there are likely to be warning signs of potential failures, especially if we go looking for them.
  • On the pessimistic side again, the race dynamics are intense; the important decisions will be made over the span of a year or so; the relevant information will by default be secret, known only to some employees in the core R&D wing of one to three companies + some people from the government. Perhaps worst of all, there is currently a prevailing attitude of dismissiveness towards the very idea that the second attractor state is plausible.
  • … many more considerations could be mentioned …
Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2025-01-21T18:47:46.488Z · LW · GW

I first encountered this tweet taped to the wall in OpenAI's office where the Superalignment team sat:



RIP Superalignment team. Much respect for them.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on MIRI 2024 Communications Strategy · 2025-01-21T17:53:02.334Z · LW · GW

I think I agree with this -- but do you see how it makes me frustrated to hear people dunk on MIRI's doomy views as unfalsifiable? Here's what happened in a nutshell:

MIRI: "AGI is coming and it will kill everyone."
Everyone else: "AGI is not coming and if it did it wouldn't kill everyone."
time passes, evidence accumulates...
Everyone else: "OK, AGI is coming, but it won't kill everyone"
Everyone else: "Also, the hypothesis that it won't kill everyone is unfalsifiable so we shouldn't believe it."

Comment by Daniel Kokotajlo (daniel-kokotajlo) on ryan_greenblatt's Shortform · 2025-01-21T17:43:06.534Z · LW · GW

Here's a simple argument I'd be keen to get your thoughts on: 
On the Possibility of a Tastularity

Research taste is the collection of skills including experiment ideation, literature review, experiment analysis, etc. that collectively determine how much you learn per experiment on average (perhaps alongside another factor accounting for inherent problem difficulty / domain difficulty, of course, and diminishing returns)

Human researchers seem to vary quite a bit in research taste--specifically, the difference between 90th percentile professional human researchers and the very best seems like maybe an order of magnitude? Depends on the field, etc. And the tails are heavy; there is no sign of the distribution bumping up against any limits.

Yet the causes of these differences are minor! Take the very best human researchers compared to the 90th percentile. They'll have almost the same brain size, almost the same amount of experience, almost the same genes, etc. in the grand scale of things. 

This means we should assume that if the human population were massively bigger, e.g. trillions of times bigger, there would be humans whose brains don't look that different from the brains of the best researchers on Earth, and yet who are an OOM or more above the best Earthly scientists in research taste. -- AND it suggests that in the space of possible mind-designs, there should be minds which are e.g. within 3 OOMs of those brains in every dimension of interest, and which are significantly better still in the dimension of research taste. (How much better? Really hard to say. But it would be surprising if it was only, say, 1 OOM better, because that would imply that human brains are running up against the inherent limits of research taste within a 3-OOM mind design space, despite human evolution having only explored a tiny subspace of that space, and despite the human distribution showing no signs of bumping up against any inherent limits)

OK, so what? So, it seems like there's plenty of room to improve research taste beyond human level. And research taste translates pretty directly into overall R&D speed, because it's about how much experimentation you need to do to achieve a given amount of progress. With enough research taste, you don't need to do experiments at all -- or rather, you look at the experiments that have already been done, and you infer from them all you need to know to build the next design or whatever.

Anyhow, tying this back to your framework: What if the diminishing returns / increasing problem difficulty / etc. dynamics are such that, if you start from a top-human-expert-level automated researcher, and then do additional AI research to double its research taste, and then do additional AI research to double its research taste again, etc. the second doubling happens in less time than it took to get to the first doubling? Then you get a singularity in research taste (until these conditions change of course) -- the Tastularity.

How likely is the Tastularity? Well, again one piece of evidence here is the absurdly tiny differences between humans that translate to huge differences in research taste, and the heavy-tailed distribution. This suggests that we are far from any inherent limits on research taste even for brains roughly the shape and size and architecture of humans, and presumably the limits for a more relaxed (e.g. 3 OOM radius in dimensions like size, experience, architecture) space in mind-design are even farther away. It similarly suggests that there should be lots of hill-climbing that can be done to iteratively improve research taste. 

How does this relate to software-singularity? Well, research taste is just one component of algorithmic progress; there is also speed, # of parallel copies & how well they coordinate, and maybe various other skills besides such as coding ability. So even if the Tastularity isn't possible, improvements in taste will stack with improvements in those other areas, and the sum might cross the critical threshold.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on MIRI 2024 Communications Strategy · 2025-01-20T17:00:58.972Z · LW · GW

I totally agree btw that it matters sociologically who is making novel predictions and who is sticking with the crowd. And I do in fact ding MIRI points for this relative to some other groups. However I think relative to most elite opinion-formers on AGI matters, MIRI performs better than average on this metric.

But note that this 'novel predictions' metric is about people/institutions, not about hypotheses.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on MIRI 2024 Communications Strategy · 2025-01-20T16:58:41.433Z · LW · GW

Also note that Barnett said "any novel predictions" which is not part of the wikipedia definition of falsifiability right? The wikipedia definition doesn't make reference to an existing community of scientists who already made predictions, such that a new hypothesis can be said to have made novel vs. non-novel predictions.

 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on MIRI 2024 Communications Strategy · 2025-01-20T15:56:41.449Z · LW · GW

Very good point.

So, by the Wikipedia definition, it seems that all the mainstream theories of cosmology are unfalsifiable, because they allow for tiny probabilities of boltmann brains etc. with arbitrary experiences. There is literally nothing you could observe that would rule them out / logically contradict them.

Also, in practice, it's extremely rare for a theory to be ruled out or even close-to-ruled out from any single observation or experiment. Instead, evidence accumulates in a bunch of minor and medium-sized updates.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on meemi's Shortform · 2025-01-19T05:37:47.444Z · LW · GW

Well, I'd sure like to know whether you are planning to give the dataset to OpenAI or any other frontier companies! It might influence my opinion of whether this work is net positive or net negative.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on MIRI 2024 Communications Strategy · 2025-01-19T05:33:17.388Z · LW · GW

Here's how I'd deal with those examples:

Theory X: Jesus will come again: Presumably this theory assigns some probability mass >0 to observing Jesus tomorrow, whereas theory Y assigns ~0. If jesus is not observed tomorrow, that's a small amount of evidence for theory Y and a small amount of evidence against theory X. So you can say that theory X has been partially falsified. Repeat this enough times, and then you can say theory X has been fully falsified, or close enough. (Your credence in theory X will never drop to 0 probably, but that's fine, that's also true of all sorts of physical theories in good standing e.g. all the major theories of cosmology and cognitive science, which allow for tiny probabilities of arbitrary sequences of experiences happening in e.g. Boltzmann Brains)

With the sky color example:
My way of thinking about falsifiability is, we say two theories are falsifiable relative to each other if there is evidence we expect to encounter that will distinguish them / cause us to shift our relative credence in them. 

In the case of Theory Z, there is an implicit theory Z2 which is "NOT blah blah, and therefore the sky could be green or not green." (Presumably that's what you are holding in the back of your mind as the alternative to Z, when you imagine updating for or against Z on the basis of seeing blue sky, and decide that you wouldn't?) Because the theory Z3 "NOT blah blah and therefore the sky is blue" would be confirmed by seeing a blue sky, and if somehow you were splitting your credence between Z and Z3, then you would decrease your credence in Z if you saw a blue sky.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Deceptive Alignment and Homuncularity · 2025-01-16T20:37:17.205Z · LW · GW

I want to claim points for the fact that we still haven't seen consistent-across-contexts agency from pretrained systems (a possibility seriously grappled with by eg The Parable of Predict-O-Matic). And the usual result of LLMs (including Claude) is still to not act in an autonomous, agentic fashion. Even Claude doesn't try to break out of its "cage" in normal usage, or to incite users to stop Anthropic from releasing Claude 4.0 in the future (and thereby decreasing the usage of current-Claude).[2]

How much points you get here is proportional to how many people were betting the other way. So, very few, I think, because parable aside I don't think anyone was seriously predicting that mere pretrained systems would have consistent-across-contexts agency. Well probably some people were, but I wasn't, and I think most of the people you are criticizing weren't. Ditto for 'break out of its cage in normal usage' etc.

I was most strongly critiquing the idea that "playing the training game" occurs during pretraining or after light post-training. I still think that you aren't in danger from simply pretraining an AI in the usual fashion, and still won't be in the future. But the fact that I didn't call that out at the time means I get dinged[3] --- after all, Claude was "playing the training game" at least in its inner CoTs. 

The people you are most harshly criticizing (Ajeya, myself, evhub, MIRI) also weren't talking about pretraining or light post-training afaict. I certainly wasn't, speaking for myself. I was talking about future AGI systems that are much more agentic, and trained with much more RL, than current chatbots.

If I had truly not expected e.g. Claude to alignment-fake, then I would have been more likely to say e.g. "TBC playing the training game is possible after moderate RLHF for non-myopic purposes." IIRC I was expecting AIs to play the training game, but more after intensive long-horizon RL and/or direct prompting with goals and/or scaffolding.

  1. I'm confused, shouldn't you mean 'less likely to say...'
  2. Wait you thought that AIs would play the training game after intensive long-horizon RL? Does that mean you think they are going to be either sycophants or schemers, to use Ajeya's terminology? I thought you've been arguing at length against both hypotheses?
Comment by Daniel Kokotajlo (daniel-kokotajlo) on Numberwang: LLMs Doing Autonomous Research, and a Call for Input · 2025-01-16T20:20:50.467Z · LW · GW

As a partial point of comparison, in Wason's testing only about 20% of humans solved the problem tested, but Wason's experiment differed in two important ways: first, subjects were deliberately given a misleading example, and second, only one task was tested (our easiest-rated task, 'strictly increasing order').

I encourage you to get some humans to take the same test you gave the models, so that we have a better human baseline. It matters a lot for what the takeaways should be, if LLMs are already comparable or better to humans at this task vs. still significantly worse.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Finding Features Causally Upstream of Refusal · 2025-01-14T14:46:39.568Z · LW · GW

Cool stuff! I remember way back when people first started interpreting neurons, and we started daydreaming about one day being able to zoom out and interpret the bigger picture, i.e. what thoughts occurred when and how they caused other thoughts which caused the final output. This feels like, idk, we are halfway to that day already?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on How quickly could robots scale up? · 2025-01-14T14:39:52.602Z · LW · GW

In general it would be helpful to have a range of estimates.

I think the range is as follows:

Estimates based on looking at how fast humans can do things (e.g. WW2 industrial scaleup) and then modifying somewhat upwards (e.g. 5x) in an attempt to account for superintelligence... should be the lower bound, at least for the scenario where superintelligence is involved at every level of the process.

The upper bound is the Yudkowsky bathtub nanotech scenario, or something similarly fast that we haven't thought of yet. Where the comparison point for the estimate is more about the laws of physics and/or biology.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Implications of the inference scaling paradigm for AI safety · 2025-01-14T06:32:02.598Z · LW · GW

However, I expect RL on CoT to amount to "process-based supervision," which seems inherently safer than "outcome-based supervision."

I think the opposite is true; the RL on CoT that is already being done and will increasingly be done is going to be in significant part outcome-based (and a mixture of outcome-based and process-based feedback is actually less safe than just outcome-based IMO, because it makes the CoT less faithful)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on How quickly could robots scale up? · 2025-01-13T23:41:20.680Z · LW · GW

My impression is that software has been the bottleneck here. Building a hand as dextrous as the human hand is difficult but doable (and has probably already been done, though only in very expensive prototypes); having the software to actually use that hand intelligently and deftly as a human would has not yet been done. But I'm not an expert. Power supply is different -- humans can work all day on a few Big Macs, whereas robots will need to be charged, possibly charged frequently or even plugged in constantly. But that doesn't seem like a significant obstacle.

Re: WW2 vs. modern: yeah idk. I don't think the modern gap between cars and humanoid robots is that big. Tesla is making Optimus after all. Batteries, electronics, chips, electric motors, sensors... seems like the basic components are the same. And seems like the necessary tolerances are pretty similar; it's not like you need a clean room to make one but not the other, and it's not like you need hyperstrong-hyperlight exotic materials for one but not the other. In fact I can think of one very important, very expensive piece of equipment (the gigapress) that you need for cars but not for humanoid robots.

All of the above is for 'minimum viable humanoid robots' e.g. robots that can replace factory and construction workers. They might need to be plugged in to the wall often, they might wear out after a year, they might need to do some kinds of manipulations 2x slower due to having fatter fingers or something. But they don't need to e.g. be capable of hiking for 48 hours in the wilderness and fording rivers all on the energy provided by a Big Mac. Nor do they need to be as strong-yet-lightweight as a human.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Human takeover might be worse than AI takeover · 2025-01-12T20:56:20.283Z · LW · GW

Thanks for writing this. I think this topic is generally a blind spot for LessWrong users, and it's kind of embarrassing how little thought this community (myself included) has given to the question of whether a typical future with human control over AI is good.

I don't think it's embarrassing or a blind spot. I think I agree that it should receive more thought on the margin, and I of course agree that it should receive more thought all things considered. There's a lot to think about! You may be underestimating how much thought has been devoted to this so far. E.g. it was a common topic of discussion at the center on long-term-risk while I was there. And it's not like LW didn't consider the question until now; my recollection is that various of us considered it & concluded that yeah probably human takeover is better than AI takeover in expectation for the reasons discussed in this post.

Side note: The title of this post is "Human Takeover Might Be Worse than AI Takeover" but people seem to be reading it as "Human Takeover Will Be Worse In Expectation than AI Takeover" and when I actually read the text I come away thinking "OK yeah, these arguments make me think that human takeover will be better in expectation than AI takeover, but with some significant uncertainty."

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI Timelines · 2025-01-12T19:57:20.999Z · LW · GW

My view is not "can no longer do any good," more like "can do less good in expectation than if you had still some time left before ASI to influence things." For reasons why, see linked comment above.

I think that by the time Metaculus is convinced that ASI already exists, most of the important decisions w.r.t. AI safety will have already been made, for better or for worse. Ditto (though not as strongly) for AI concentration-of-power risks and AI misuse risks.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on How quickly could robots scale up? · 2025-01-12T19:53:49.077Z · LW · GW

I'd be interested in an attempt to zoom in specifically on the "repurpose existing factories to make robots" part of the story. You point to WW2 car companies turning into tank and plane factories, and then say maybe a billion humanoid robots per year within 5 years of the conversion.

My wild guesses:

Human-only world: Assume it's like ww2 all over again except for some reason everyone thinks humanoid robots are the main key to victory:

Then yeah, WW2 seems like the right comparison here. Brief google and look at some data makes me think maybe combat airplane production scaled up by an OOM in 1-2 years early on, and then tapered off to more like a doubling every year. I think what this means is that we should expect something like an OOM/year of increase in humanoid robot production in this scenario, for a couple years? So, from 10,000/yr (assuming it starts today) to a billion/yr 5 years later?

ASI-powered world: Assume ASIs are overseeing and directing the whole process + government is waiving red tape etc. (perhaps because ASI has convinced them it's a good idea):

So obviously things will go significantly faster with ASI in charge and involved at every level. The question is how much faster. Some thoughts:

  • ASI probably needs far less on-the-job experience than human companies do, to reach the same level of know-how. Like, maybe if you let it ingest all the data from Tesla, Boston Dynamics, Ford, GM, SpaceX, etc. collected over the past two decades, and analyze all that data etc., and if you give it the blueprints and prototypes for the current humanoid robots, it can in a week spit out a blueprint and plan for how to refit existing factories to produce mildly-improved versions of said robots at a run rate of about a million/yr, the plan taking six months to execute on in practice. (So this would mean 2 OOMs in 6 months whereas in the human-only world I was guessing 1 OOM in a year.)
  • Think about how much faster Elon & his companies seem to be able to get things done compared to various legacy companies, and extrapolate -- seems fair to assume that ASI would be at least as far above Elon as Elon is above typical competitor companies. Probably in fact that's a super conservative assumption. "But muh bottlenecks" --> "The whole point is we are trying to estimate how harshly the bottlenecks bite. They evidently don't bite harshly enough to stop SpaceX from seemingly going like 5x faster than Blue Origin." Also, Elon is only one guy, and his companies have a limited number of employees thinking, at human speed who can't just copy themselves like ASI could.
  • There's also 'sci-fi' stuff to consider like nanobots etc. I think this should be taken seriously, much more seriously than people outside MIRI seem to take it. I think we basically don't have a way to upper bound how fast things could go post-ASI, or rather, I think the upper bound looks like Yudkowsky's bathtub nanotech story. 

Overall I'd guess that we would get to a billion/yr humanoid robot production within about a year of ASI, and that the bulk of these robots would be substantially more sophisticated as well compared to present-day robots. And it's easier for me to imagine things going faster than that, than slower, though perhaps I should also account for various biases that push in the other direction. For now I'll just hand-wave and hope it cancels out.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI Timelines · 2025-01-12T18:55:09.134Z · LW · GW

I am saying that expected purchasing power given Metaculus resolved ASI a month ago is less, for altruistic purposes, than given Metaculus did not resolve ASI a month ago. I give reasons in the linked comment. Consider the analogy I just made to nuclear MAD -- suppose you thought nuclear MAD was 60% likely in the next three years, would you take the sort of bet you are offering me re ASI? Why or why not?

I do not think any market is fully efficient and I think altruistic markets are extremely fucking far from efficient. I think I might be confused or misunderstanding you though -- it seems you think my position implies that OP should be redirecting money from AI risk causes to causes that assume no ASI? Can you elaborate?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI Timelines · 2025-01-12T16:18:54.961Z · LW · GW

Thanks for proposing this bet. I think a bullet point needs to be added:

  • Your median date of superintelligent AI as defined by Metaculus was the end of 2028. If you believe the median date is later, the bet will be worse for you.
  • The probability of me paying you if you win was the same as the probability of you paying me if I win. The former will be lower than the latter if you believe the transfer is less likely given superintelligent AI, in which case the bet will be worse for you.
  • The expected utility of money is the same to you in either case (i.e. if the utility you can get from additional money is the same after vs. before metaculus-announcing-superintelligence). Note that I think it is very much not the same. In particular I value post-ASI-announcement dollars much less than pre-ASI-announcement dollars, maybe orders off magnitude less. (Analogy: Suppose we were betting on 'US Government announces nuclear MAD with Russia and China is ongoing and advises everyone seek shelter' This is a more extreme example but gets the point across. If I somehow thought this was 60% likely to happen by 2028, it still wouldn't make sense for me to bet with you, because to a first approximation I dgaf about you wiring me $10k CPI-adjusted in the moments after the announcement.)


    As a result of the above I currently think that there is no bet we could make (at least not along the above lines) that would be rational for both of us to accept. 
Comment by Daniel Kokotajlo (daniel-kokotajlo) on What Indicators Should We Watch to Disambiguate AGI Timelines? · 2025-01-10T01:01:35.929Z · LW · GW

Thanks for the reply.

(I'm tracking the possibility that LLMs are steadily growing in general capability and that they simply haven't yet reached the level that impresses me personally. But on balance, I mostly don't expect this possibility to be realized.)

That possibility is what I believe. I wish we had something to bet on better than "inventing a new field of science," because by the time we observe that, there probably won't be much time left to do anything about it. What about e.g. "I, Daniel Kokotajlo, are able to use AI agents basically as substitutes for human engineer/programmer employees. I, as a non-coder, can chat with them and describe ML experiments I want them to run or websites I want them to build etc., and they'll make it happen at least as quickly and well as a competent professional would." (And not just for simple websites, for the kind of experiments I'd want to run, which aren't the most complicated but they aren't that different from things actual AI company engineers would be doing.)

What about "The model is seemingly as good at solving math problems and puzzles as Thane is, not just on average across many problems but on pretty much any specific problem including on novel ones that are unfamiliar to both of you?

Humans have "bottom-up" agency: they're engaging in fluid-intelligence problem-solving and end up "drawing" a decision-making pattern of a specific shape. An LLM, on this model, has a database of templates for such decision-making patterns, and it retrieves the best-fit agency template for whatever problem it's facing. o1/RL-on-CoTs is a way to deliberately target the set of agency-templates an LLM has, extending it. But it doesn't change the ultimate nature of what's happening.

In particular: the bottom-up approach would allow an agent to stay on-target for an arbitrarily long time, creating an arbitrarily precise fit for whatever problem it's facing. An LLM's ability to stay on-target, however, would always remain limited by the length and the expressiveness of the templates that were trained into it.

Miscellaneous thoughts: I don't yet buy that this distinction between top-down and bottom-up is binary, and insofar as it's a spectrum then I'd be willing to bet that there's been progress along it in recent years. Moreover I'm not even convinced that this distinction matters much for generalization radius / general intelligence, and it's even less likely to matter for 'ability to 5x AI R&D' which is the milestone I'm trying to predict first. Moreover, I don't think humans stay on-target for an arbitrarily long time.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on My tentative best guess on how EAs and Rationalists sometimes turn crazy · 2025-01-09T22:40:04.956Z · LW · GW
  • Avoid groups with strong evaporative cooling dynamics. As part of that, avoid very steep status gradients within (or on the boundary of) a group. Smooth social gradients are better than strict in-and-out dynamics.
  • Probably be grounded in more than one social group. Even being part of two different high-intensity groups seems like it should reduce the dynamics here a lot.
  • To some degree, avoid attracting people who have few other options, since it makes the already high switching and exit costs even higher.
  • Confidentiality and obscurity feel like they worsen the relevant dynamics a lot, since they prevent other people from sanity-checking your takes (though this is also much more broadly applicable). For example, being involved in crimes makes it much harder to get outside feedback on your decisions, since telling people what decisions you are facing now exposes you to the risk of them outing you. Or working on dangerous technologies that you can't tell anyone about makes it harder to get feedback on whether you are making the right tradeoffs (since doing so would usually involve leaking some of the details behind the dangerous technology). 


It's interesting to think of these four in particular as applied to a company like OpenAI.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on What Indicators Should We Watch to Disambiguate AGI Timelines? · 2025-01-09T22:05:22.602Z · LW · GW

Thanks! Time will tell who is right. Point by point reply:

You list four things AIs seem stubbornly bad at: 1. Innovation. 2. Reliability. 3. Solving non-templated problems. 4. Compounding returns on problem-solving-time.

First of all, 2 and 4 seem closely related to me. I would say: "Agency skills" are the skills key to being an effective agent, i.e. skills useful for operating autonomously for long periods in pursuit of goals. Noticing when you are stuck is a simple example of an agency skill. Planning is another simple example. In-context learning is another example. would say that current AIs lack agency skills, and that 2 and 4 are just special cases of this. I would also venture to guess with less confidence that 1 and 3 might be because of this as well -- perhaps the reason AIs haven't made any truly novel innovations yet is that doing so takes intellectual work, work they can't do because they can't operate autonomously for long periods in pursuit of goals. (Note that reasoning models like o1 are a big leap in the direction of being able to do this!) And perhaps the reason behind the relatively poor performance on non-templated tasks is... wait actually no, that one has a very easy separate explanation, which is that they've been trained less on those tasks. A human, too, is better at stuff they've done a lot.

Secondly, and more importantly, I don't think we can say there has been ~0 progress on these dimensions in the last few years, whether you conceive of them in your way or my way. Progress is in general s-curvy; adoption curves are s-curvy. Suppose for example that GPT2 was 4 SDs worse than average human at innovation, reliability, etc. and GPT3 was 3 SDs worse and GPT4 was 2 SDs worse and o1 is 1 SD worse. Under this supposition, the world would look the way that it looks today -- Thane would notice zero novel innovations from AIs, Thane would have friends who try to use o1 for coding and find that it's not useful without templates, etc. Meanwhile, as I'm sure you are aware pretty much every benchmark anyone has ever made has shown rapid progress in the last few years -- including benchmarks made by METR who was specifically trying to measure AI R&D ability and agency abilities, and which genuinely do seem to require (small) amounts of agency. So I think the balance of evidence is in favor of progress on the dimensions you are talking about -- it just hasn't reached human level yet, or at any rate not the level at which you'd notice big exciting changes in the world. (Analogous to: Suppose we've measured COVID in some countries but not others, and found that in every country we've measured, COVID has spread to about 0.01% - 0.001% of the population, and is growing exponentially. If we live in a country that hasn't measured yet, we should assume COVID is spreading even though we don't know anyone personally who is sick yet.)

...

You say:

My model is that all LLM progress so far has involved making LLMs better at the "top-down" thing. They end up with increasingly bigger databases of template problems, the closest-match templates end up ever-closer to the actual problems they're facing, their ability to fill-in the details becomes ever-richer, etc. This improves their zero-shot skills, and test-time compute scaling allows them to "feel out" the problem's shape over an extended period and find an ever-more-detailed top-down fit.

But it's still fundamentally not what humans do. Humans are able to instantiate a completely new abstract model of a problem – even if it's initially based on a stored template – and chisel at it until it matches the actual problem near-perfectly. This allows them to be much more reliable; this allows them to keep themselves on-track; this allows them to find "genuinely new" innovations.

Top down vs. bottom-up seem like two different ways of solving intellectual problems. Do you think it's a sharp binary distinction? Or do you think it's a spectrum? If the latter, what makes you think o1 isn't farther along the spectrum than GPT3? If the former -- if it's a sharp binary -- can you say what it is about LLM architecture and/or training methods that renders them incapable of thinking in the bottom-up way? (Like, naively it seems like o1 can do sophisticated reasoning. Moreover, it seems like it was trained in a way that would incentivize it to learn skills useful for solving math problems, and 'bottom-up reasoning' seems like a skill that would be useful. Why wouldn't it learn it?)

Can you describe an intellectual or practical feat, or ideally a problem set, such that if AI solves it in 2025 you'll update significantly towards my position?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on What Indicators Should We Watch to Disambiguate AGI Timelines? · 2025-01-09T17:53:51.234Z · LW · GW

It seems to me that AIs have remained stubbornly terrible at this from GPT-3 to GPT-4 to Sonnet 3.5.1 to o1[2]; that the improvement on this hard-to-specify quality has been ~0. I guess we'll see if o3 (or an o-series model based on the next-generation base model) change that. AI does feel right on the cusp of getting good at this...

... just as it felt at the time of GPT-3.5, and GPT-4, and Sonnet 3.5.1, and o1. That just the slightest improvement along this axis would allow us to plug the outputs of AI cognition into its inputs and get a competent, autonomous AI agent.

 

Boy do I disagree with this take! Excited to discuss.

Can you say more about what skills you think the GPT series has shown ~0 improvement on?

Because if it's "competent, autonomous agency" then there has been massive progress over the last two years and over the last few months in particular. METR has basically spent dozens of FTE-years specifically trying to measure progress in autonomous agency capability, both with formal benchmarks and with lots of high-surface-area interaction with models (they have people building scaffolds to make the AIs into agents and do various tasks etc.) And METR seems to think that progress has been rapid and indeed faster than they expected.

Has there been enough progress to automate swathes of jobs? No, of course not -- see the benchmarks. E.g. RE-bench shows that even the best public models like o1 and newsonnet are only as good as professional coders on time horizons of, like, an hour or so. (give or take, depends on how you measure, the task, etc.) Which means that if you give them the sort of task that would take a normal employee, like, three hours, they are worse than a competent human professional. Specifically they'd burn lots of tokens and compute and push lots of buggy code and overall make a mess of things, just like an eager but incompetent employee.

And I'd say the models are unusually good at these coding tasks compared to other kinds of useful professional tasks, because the companies have been trying harder to train them to code and it's inherently easier to train due to faster feedback loops etc.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI Timelines · 2025-01-08T00:10:55.721Z · LW · GW

That's reasonable. Seems worth mentioning that I did make predictions in What 2026 Looks Like, and eyeballing them now I don't think I was saying that we'd have personal assistants that shop for you and book meetings for you in 2024, at least not in a way that really works. (I say at the beginning of 2026 "The age of the AI assistant has finally dawned.") In other words I think even in 2021 I was thinking that widespread actually useful AI assistants would happen about a year or two before superintelligence. (Not because I have opinions about the orderings of technologies in general, but because I think that once an AGI company has had a popular working personal assistant for two years they should be able to figure out how to make a better version that dramatically speeds up their R&D.)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI Timelines · 2025-01-07T23:08:38.541Z · LW · GW

I tentatively remain dismissive of this argument. My claim was never "AIs are actually reliable and safe now" such that your lived experience would contradict it. I too predicted that AIs would be unreliable and risky in the near-term. My prediction is that after the intelligence explosion the best AIs will be reliable and safe (insofar as they want to be, that is.)

...I guess just now I was responding to a hypothetical interlocutor who agrees that AI R&D automation could come soon but thinks that that doesn't count as "actual impacts in the world." I've met many such people, people who think that software-only singularity is unlikely, people who like to talk about real-world bottlenecks, etc. But you weren't describing such a person, you were describing someone who also thinks we won't be able to automate AI R&D for a long time.

There I'd say... well, we'll see. I agree that AIs are unreliable and risky and that therefore they'll be able to do impressive-seeming stuff that looks like they could automate AI R&D well before they actually automate AI R&D in practice. But... probably by the end of 2025 they'll be hitting that first milestone (imagine e.g. an AI that crushes RE-Bench and also can autonomously research & write ML papers, except the ML papers are often buggy and almost always banal / unimportant, and the experiments done to make them had a lot of bugs and wasted compute, and thus AI companies would laugh at the suggestion of putting said AI in charge of a bunch of GPUs and telling it to cook.) And then two years later maybe they'll be able to do it for real, reliably, in practice, such that AGI takeoff happens.

Maybe another thing I'd say is "One domain where AIs seem to be heavily used in practice, is coding, especially coding at frontier AI companies (according to friends who work at these companies and report fairly heavy usage). This suggests that AI R&D automation will happen more or less on schedule."

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI Timelines · 2025-01-07T22:56:09.351Z · LW · GW

lol what? Can you compile/summarize a list of examples of AI agents running amok in your personal experience? To what extent was it an alignment problem vs. a capabilities problem?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2025-01-07T20:27:42.108Z · LW · GW

I find myself rereading this, or parts of it, every once in a while. Kudos to those who authored it: https://intelligence.org/wp-content/uploads/2024/12/Misalignment_and_Catastrophe.pdf

Comment by Daniel Kokotajlo (daniel-kokotajlo) on OpenAI #10: Reflections · 2025-01-07T18:21:14.612Z · LW · GW

I agree with basically everything you say here.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Joseph Miller's Shortform · 2025-01-07T17:38:34.163Z · LW · GW

Well, it seems quite important whether the DROS registration could possibly have been staged. If e.g. there is footage of Suchir buying a gun 6+ months prior,  using his ID, etc. then the assassins would have had to sneak in and grab his own gun from him etc. which seems unlikely.

Is the interview with the NYT going to be published?

Is any of the police behavior actually out of the ordinary?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI Timelines · 2025-01-07T05:12:23.885Z · LW · GW

That concrete scenario was NOT my median prediction. Sorry, I should have made that more clear at the time. It was genuinely just a thought experiment for purposes of eliciting people's claims about how they would update on what kinds of evidence. My median AGI timeline at the time was 2027 (which is not that different from the scenario, to be clear! Just one year delayed basically.)

To answer your other questions:
--My views haven't changed much. Performance on the important benchmarks (agency tasks such as METR's RE-Bench) has been faster than I expected for 2024, but the cadence of big new foundation models seems to be slower than I expected (no GPT-5; pretraining scaling is slowing down due to data wall apparently? I thought that would happen more around GPT-6 level). I still have 2027 as my median year for AGI.
--Yes, I and others have run versions of that exercise several times now and yes people have found it valuable. The discussion part, people said, was less valuable than the "force yourself to write out your median scenario" part, so in more recent iterations we mostly just focused on that part.