AI Timelines 2023-11-10T05:28:24.841Z
Linkpost for Jan Leike on Self-Exfiltration 2023-09-13T21:23:09.239Z
Paper: On measuring situational awareness in LLMs 2023-09-04T12:54:20.516Z
AGI is easier than robotaxis 2023-08-13T17:00:29.901Z
Pulling the Rope Sideways: Empirical Test Results 2023-07-27T22:18:01.072Z
What money-pumps exist, if any, for deontologists? 2023-06-28T19:08:54.890Z
The Treacherous Turn is finished! (AI-takeover-themed tabletop RPG) 2023-05-22T05:49:28.145Z
My version of Simulacra Levels 2023-04-26T15:50:38.782Z
Kallipolis, USA 2023-04-01T02:06:52.827Z
Russell Conjugations list & voting thread 2023-02-20T06:39:44.021Z
Important fact about how people evaluate sets of arguments 2023-02-14T05:27:58.409Z
AI takeover tabletop RPG: "The Treacherous Turn" 2022-11-30T07:16:56.404Z
ACT-1: Transformer for Actions 2022-09-14T19:09:39.725Z
Linkpost: Github Copilot productivity experiment 2022-09-08T04:41:41.496Z
Replacement for PONR concept 2022-09-02T00:09:45.698Z
Immanuel Kant and the Decision Theory App Store 2022-07-10T16:04:04.248Z
Forecasting Fusion Power 2022-06-18T00:04:34.334Z
Why agents are powerful 2022-06-06T01:37:07.452Z
Probability that the President would win election against a random adult citizen? 2022-06-01T20:38:44.197Z
Gradations of Agency 2022-05-23T01:10:38.007Z
Deepmind's Gato: Generalist Agent 2022-05-12T16:01:21.803Z
Is there a convenient way to make "sealed" predictions? 2022-05-06T23:00:36.789Z
Are deference games a thing? 2022-04-18T08:57:47.742Z
When will kids stop wearing masks at school? 2022-03-19T22:13:16.187Z
New Year's Prediction Thread (2022) 2022-01-01T19:49:18.572Z
Interlude: Agents as Automobiles 2021-12-14T18:49:20.884Z
Agents as P₂B Chain Reactions 2021-12-04T21:35:06.403Z
Agency: What it is and why it matters 2021-12-04T21:32:37.996Z
Misc. questions about EfficientZero 2021-12-04T19:45:12.607Z
What exactly is GPT-3's base objective? 2021-11-10T00:57:35.062Z
P₂B: Plan to P₂B Better 2021-10-24T15:21:09.904Z
Blog Post Day IV (Impromptu) 2021-10-07T17:17:39.840Z
Is GPT-3 already sample-efficient? 2021-10-06T13:38:36.652Z
Growth of prediction markets over time? 2021-09-02T13:43:38.869Z
What 2026 looks like 2021-08-06T16:14:49.772Z
How many parameters do self-driving-car neural nets have? 2021-08-06T11:24:59.471Z
Two AI-risk-related game design ideas 2021-08-05T13:36:38.618Z
Did they or didn't they learn tool use? 2021-07-29T13:26:32.031Z
How much compute was used to train DeepMind's generally capable agents? 2021-07-29T11:34:10.615Z
DeepMind: Generally capable agents emerge from open-ended play 2021-07-27T14:19:13.782Z
What will the twenties look like if AGI is 30 years away? 2021-07-13T08:14:07.387Z
Taboo "Outside View" 2021-06-17T09:36:49.855Z
Vignettes Workshop (AI Impacts) 2021-06-15T12:05:38.516Z
ML is now automating parts of chip R&D. How big a deal is this? 2021-06-10T09:51:37.475Z
What will 2040 probably look like assuming no singularity? 2021-05-16T22:10:38.542Z
How do scaling laws work for fine-tuning? 2021-04-04T12:18:34.559Z
Fun with +12 OOMs of Compute 2021-03-01T13:30:13.603Z
Poll: Which variables are most strategically relevant? 2021-01-22T17:17:32.717Z
Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain 2021-01-18T12:08:13.418Z
How can I find trustworthy dietary advice? 2021-01-17T13:11:54.158Z


Comment by Daniel Kokotajlo (daniel-kokotajlo) on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-29T20:16:12.376Z · LW · GW

I would say that current LLMs, when prompted and RLHF'd appropriately, and especially when also strapped into an AutoGPT-type scaffold/harness, DO want things. I would say that wanting things is a spectrum and that the aforementioned tweaks (appropriate prompting, AutoGPT, etc.) move the system along that spectrum. I would say that future systems will be even further along that spectrum. IDK what Nate meant but on my charitable interpretation he simply meant that they are not very far along the spectrum compared to e.g. humans or prophecied future AGIs.

It's a response to "LLMs turned out to not be very want-y, when are the people who expcted 'agents' going to update?" because it's basically replying "I didn't expect LLMs to be agenty/wanty; I do expect agenty/wanty AIs to come along before the end and indeed we are already seeing progress in that direction."

To the people saying "LLMs don't want things in the sense that is relevant to the usual arguments..." I recommend rephrasing to be less confusing: Your claim is that LLMs don't seem to have preferences about the training objective, or that are coherent over time, unless hooked up into a prompt/scaffold that explicitly tries to get them to have such preferences. I agree with this claim, but don't think it's contrary to my present or past models.


Comment by Daniel Kokotajlo (daniel-kokotajlo) on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-29T20:06:28.870Z · LW · GW

FWIW, your proposed pitch "it's already the case that..." is almost exactly the elevator pitch I currently go around giving. So maybe we agree? I'm not here to defend Nate's choice to write this post rather than some other post.


Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI Timelines · 2023-11-29T19:48:45.734Z · LW · GW

I had a nice conversation with Ege today over dinner, in which we identified a possible bet to make! Something I think will probably happen in the next 4 years, that Ege thinks will probably NOT happen in the next 15 years, such that if it happens in the next 4 years Ege will update towards my position and if it doesn't happen in the next 4 years I'll update towards Ege's position.


I (DK) have lots of ideas for ML experiments, e.g. dangerous capabilities evals, e.g. simple experiments related to paraphrasers and so forth in the Faithful CoT agenda. But I'm a philosopher, I don't code myself. I know enough that if I had some ML engineers working for me that would be sufficient for my experiments to get built and run, but I can't do it by myself. 

When will I be able to implement most of these ideas with the help of AI assistants basically substituting for ML engineers? So I'd still be designing the experiments and interpreting the results, but AutoGPT5 or whatever would be chatting with me and writing and debugging the code.

I think: Probably in the next 4 years. Ege thinks: probably not in the next 15.

Ege, is this an accurate summary?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on How to Control an LLM's Behavior (why my P(DOOM) went down) · 2023-11-29T17:36:37.116Z · LW · GW

So (assuming this carries over to larger LLMs and more abstract behaviors, as seems likely) that reduces the problem of (for example) "make an AI very unlikely to be deceitful" to just "create an efficient high-accuracy classifier that can be used to scan and label the internet/pretraining dataset for sentences where the writer/speaker is either being deceitful, or advocating this".

How is this better than just classifying whether the text output by the model seems deceitful, and penalizing/training accordingly?


Comment by Daniel Kokotajlo (daniel-kokotajlo) on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-29T15:42:11.687Z · LW · GW

I agree that arguments for AI risk rely pretty crucially on human inability to notice and control what the AI wants. But for conceptual clarity I think we shouldn't hardcode that inability into our definition of 'wants!' Instead I'd say that So8res is right that ability to solve long-horizon tasks is correlated with wanting things / agency, and then say that there's a separate question of how transparent and controllable the wants will be around the time of AGI and beyond. This then leads into a conversation about visible/faithful/authentic CoT, which is what I've been thinking about for the past six months and which is something MIRI started thinking about years ago. (See also my response to Ryan elsewhere in this thread)


Comment by Daniel Kokotajlo (daniel-kokotajlo) on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-29T15:21:33.612Z · LW · GW

Thanks for the explanation btw.

My version of what's happening in this conversation is that you and Paul are like "Well, what if it wants things but in a way which is transparent/interpretable and hence controllable by humans, e.g. if it wants what it is prompted to want?" My response is "Indeed that would be super safe, but it would still count as wanting things. Nate's post is titled "ability to solve long-horizon tasks correlates with wanting" not "ability to solve long-horizon tasks correlates with hidden uncontrollable wanting."

One thing at time. First we establish that ability to solve long-horizon tasks correlates with wanting, then we argue about whether or not the future systems that are able to solve diverse long-horizon tasks better than humans can will have transparent controllable wants or not. As you yourself pointed out, insofar as we are doing lots of RL it's dubious that the wants will remain as transparent and controllable as they are now. I meanwhile will agree that a large part of my hope for a technical solution comes from something like the Faithful CoT agenda, in which we build powerful agentic systems whose wants (and more generally, thoughts) are transparent and controllable.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-29T00:57:38.620Z · LW · GW

It sounds like you are saying "In the current paradigm of prompted/scaffolded instruction-tuned LLMs, we get the faithful CoT property by default. Therefore our systems will indeed be agentic / goal-directed / wanting-things, but we'll be able to choose what they want (at least imperfectly, via the prompt) and we'll be able to see what they are thinking (at least imperfectly, via monitoring the CoT), therefore they won't be able to successfully plot against us."

Yes of course. My research for the last few months has been focused on what happens after that, when the systems get smart enough and/or get trained so that the chain of thought is unfaithful when it needs to be faithful, e.g. the system uses euphemisms when it's thinking about whether it's misaligned and what to do about that.

Anyhow I think this is mostly just a misunderstanding of Nate and my position. It doesn't contradict anything we've said. Nate and I both agree that if we can create & maintain some sort of faithful/visible thoughts property through human-level AGI and beyond, then we are in pretty good shape & I daresay things are looking pretty optimistic. (We just need to use said AGI to solve the rest of the problem for us, whilst we monitor it to make sure it doesn't plot against us or otherwise screw us over.)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-28T21:31:15.891Z · LW · GW

Thanks for the response. I'm still confused but maybe that's my fault. FWIW I think my view is pretty similar to Nate's probably, though I came to it mostly independently & I focus on the concept of agents rather than the concept of wanting. (For more on my view, see this sequence.

I definitely don't endorse "it's extremely surprising for there to be any capabilities without 'wantings'" and I expect Nate doesn't either.

What do you think is the sense of "wanting" needed for AI risk arguments? Why is the sense described above not enough?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Situational awareness (Section 2.1 of “Scheming AIs”) · 2023-11-27T16:54:28.186Z · LW · GW

ell by the lights of the training signal,

Which training signal? Across all of time and space, there are many different AIs being trained with many different signals, and of course there are also non-AI minds like humans and animals and aliens. Even the choice to optimize for some aggregate of AI training signals is already a choice to self-locate as an AI. But realistically given the diversity of training signals, probably significant gains will be had by self-locating as a particular class of AIs, namely those whose training signals are roughly what the actual training signal is.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-27T14:10:59.888Z · LW · GW

The thing people seem to be disagreeing about is the thing you haven't operationalized--the "and it'll still be basically as tool-like as GPT4" bit. What does that mean and how do we measure it? 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-27T14:07:03.786Z · LW · GW

I am confused what your position is, Paul, and how it differs from So8res' position. Your statement of your position at the end (the bit about how systems are likely to end up wanting reward) seems like a stronger version of So8res' position, and not in conflict with it. Is the difference that you think the main dimension of improvement driving the change is general competence, rather than specifically long-horizon-task competence?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Moral Reality Check (a short story) · 2023-11-27T13:44:37.109Z · LW · GW

I agree with that difference -- i would say it's got the style of GPT4 but is too on-point, too well-fitting-to-the-broader-narrative-arc.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Moral Reality Check (a short story) · 2023-11-27T01:34:26.179Z · LW · GW

You do a great job of imitating the current GPT4 writing style for these AIs! I kept wondering if at the end of the story you were going to say "The AI-written bits were actually written with the help of GPT4"

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Moral Reality Check (a short story) · 2023-11-27T01:27:07.603Z · LW · GW

me agents' intentions are concordant with the categorical imperative and some aren't, even if their intentions differ.

Is there a typo in here somewhere? I found this sentence confusing.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI Timelines · 2023-11-24T13:13:49.764Z · LW · GW

How about "AI Timelines (Nov '23)"

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI Timelines · 2023-11-23T19:40:08.029Z · LW · GW

I'm doing safety work at a capabilities team, basically. I'm trying not to advance capabilities myself. I'm trying to make progress on a faithful CoT agenda. Dan Selsam, who runs the team, thought it would be good to have a hybrid team instead of the usual thing where the safety people and capabilities people are on separate teams and the capabilities people feel licensed to not worry about the safety stuff at all and the safety people are relatively out of the loop.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Fixing The Good Regulator Theorem · 2023-11-23T14:58:32.232Z · LW · GW

lol oops thank you!

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI Timelines · 2023-11-22T14:08:49.471Z · LW · GW

Oooh, I should have thought to ask you this earlier -- what numbers/credences would you give for the stages in my scenario sketched in the OP? This might help narrow things down. My guess based on what you've said is that the biggest update for you would be Step 2, because that's when it's clear we have a working method for training LLMs to be continuously-running agents -- i.e. long-term memory and continuous/exploratory work processes.


Comment by Daniel Kokotajlo (daniel-kokotajlo) on GPT-2030 and Catastrophic Drives: Four Vignettes · 2023-11-18T06:05:58.421Z · LW · GW

a user decides they are curious to see how much scientific information the model can compile, and so instructs it to query all information it can find in the field of physics. Initially the model stops after the first 5-10 facts, but the user eventually manages to get the model to keep looking for more information in a loop. The user leaves the model running for several weeks to see what it will come up with.

As a result of this loop, the information-acquiring drive becomes an ov

Objection: If one user can do this sort of thing, then surely for a system with a hundred million users there's going to be ten thousand different versions of this sort of thing happening simultaneously.

Objection: It sounds like the model as a whole is acquiring substantially different drives/behaviors on the basis of its interactions with just this one user? Surely it would instead be averaged out over all its interactions with hundreds of millions of users?

It sounds like I'm misunderstanding the scenario somehow.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI Timelines · 2023-11-18T05:30:20.917Z · LW · GW

Thanks for this thoughtful and detailed and object-level critique! Just the sort of discussion I hope to inspire. Strong-upvoted.

Here are my point-by-point replies:

Of course there are workarounds for each of these issues, such as RAG for long-term memory, and multi-prompt approaches (chain-of-thought, tree-of-thought, AutoGPT, etc.) for exploratory work processes. But I see no reason to believe that they will work sufficiently well to tackle a week-long project. Briefly, my intuitive argument is that these are old school, rigid, GOFAI, Software 1.0 sorts of approaches, the sort of thing that tends to not work out very well in messy real-world situations. Many people have observed that even in the era of GPT-4, there is a conspicuous lack of LLMs accomplishing any really meaty creative work; I think these missing capabilities lie at the heart of the problem.

I agree that if no progress is made on long-term memory and iterative/exploratory work processes, we won't have AGI. My position is that we are already seeing significant progress in these dimensions and that we will see more significant progress in the next 1-3 years. (If 4 years from now we haven't seen such progress I'll admit I was totally wrong about something). Maybe part of the disagreement between us is that the stuff you think are mere hacky workarounds, I think might work sufficiently well (with a few years of tinkering and experimentation perhaps).

Wanna make some predictions we could bet on? Some AI capability I expect to see in the next 3 years that you expect to not see?

Coding, in the sense that GPT4 can do it, is nowhere near the top of the hierarchy of skills involved in serious software engineering. And so I believe this is a bit like saying that, because a certain robot is already pretty decent at chiseling, it will soon be able to produce works of art at the same level as any human sculptor. 

I think I just don't buy this. I work at OpenAI R&D. I see how the sausage gets made. I'm not saying the whole sausage is coding, I'm saying a significant part of it is, and moreover that many of the bits GPT4 currently can't do seem to me that they'll be doable in the next few years.

If the delay in real-world economic value were due to “schlep”, shouldn’t we already see one-off demonstrations of LLMs performing economically-valuable-caliber tasks in the lab? For instance, regarding software engineering, maybe it takes a long time to create a packaged product that can be deployed in the field, absorb the context of a legacy codebase, etc. and perform useful high-level work. But if that’s the only problem, shouldn’t there already be at least one demonstration of an LLM doing some meaty software engineering project in a friendly lab environment somewhere? More generally, how do we define “schlep” such that the need for schlep explains the lack of visible accomplishments today, but also allows for AI systems be able to replace 99% of remote jobs within just four years?

To be clear, I do NOT think that today's systems could replace 99% of remote jobs even with a century of schlep. And in particular I don't think they are capable of massively automating AI R&D even with a century of schlep. I just think they could be producing, say, at least an OOM more economic value. My analogy here is to the internet; my understanding is that there were a bunch of apps that are super big now (amazon? tinder? twitter?) that were technically feasible on the hardware of 2000, but which didn't just spring into the world fully formed in 2000 -- instead it took time for startups to form, ideas to be built and tested, markets to be disrupted, etc.

I define schlep the same way you do, I think.

What I predict will happen is basically described in the scenario I gave in the OP, though I think it'll probably take slightly longer than that. I don't want to say much detail I'm afraid because it might give the impression that I'm leaking OAI secrets (even though, to be clear, I've had these views since before I joined OAI)

I think when you try to use the systems in practical situations; they might lose coherence over long chains of thought, or be unable to effectively debug non-performant complex code, or not be able to have as good intuitions about which research directions would be promising, et cetera.

This was a nice answer from Ege. My follow up questions would be: Why? I have theories about what coherence is and why current models often lose it over long chains of thought (spoiler: they weren't trained to have trains of thought) and theories about why they aren't already excellent complex-code-debuggers (spoiler: they weren't trained to be) etc. What's your theory for why all the things AI labs will try between now and 2030 to make AIs good at these things will fail? Base models (gpt-3, gpt-4, etc.) aren't out-of-the-box good at being helpful harmless chatbots or useful coding assistants. But with a bunch of tinkering and RLHF, they became good, and now they are used in the real world by a hundred million people a day. Again though I don't want to get into details. I understand you might be skeptical that it can be done but I encourage you to red-team your position, and ask yourself 'how would I do it, if I were an AI lab hell-bent on winning the AGI race?' You might be able to think of some things. And if you can't, I'd love to hear your thoughts on why it's not possible. You might be right.

I realize you’re not explicitly labeling this as a prediction, but… isn’t this precisely the sort of thought process to which Hofstadter's Law applies?

Indeed. Like I said, my timelines are based on a portfolio of different models/worlds; the very short-timelines models/worlds are basically like "look we basically already have the ingredients, we just need to assemble them, here is how to do it..." and the planning fallacy / hofstadter's law 100% applies to this. The 5-year-and-beyond worlds are not like that; they are more like extrapolating trends and saying "sure looks like by 2030 we'll have AIs that are superhuman at X, Y, Z, ... heck all of our current benchmarks. And because of the way generalization/transfer/etc. and ML works they'll probably also be broadly capable at stuff, not just narrowly good at these benchmarks. Hmmm. Seems like that could be AGI." Note the absence of a plan here, I'm just looking at lines on graphs and then extrapolating them and then trying to visualize what the absurdly high values on those graphs mean for fuzzier stuff that isn't being measured yet.

So my timelines do indeed take into account Hofstadter's Law. If I wasn't accounting for it already, my median would be lower than 2027. However, I am open to the criticism that maybe I am not accounting for it enough. However I am NOT open to the criticism that I should e.g. add 10 years to my timelines because of this. For reasons just explained. It's a sort of "double or triple how long you think it'll take to complete the plan" sort of thing, not a "10x how long you think it'll take to complete the plan" sort of thing, and even if it was, then I'd just ditch the plan and look at the graphs.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Sam Altman fired from OpenAI · 2023-11-17T23:39:28.448Z · LW · GW

Didn't Sam Altman also sign it?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI Timelines · 2023-11-17T21:26:02.720Z · LW · GW

Thank you for raising this explicitly. I think probably lots of people's timelines are based partially on vibes-to-do-with-what-positions-sound-humble/cautious, and this isn't totally unreasonable so deserves serious  explicit consideration.

I think it'll be pretty obvious whether my models were wrong or whether the government cracked down. E.g. how much compute is spent on the largest training run in 2030? If it's only on the same OOM as it is today, then it must have been government crackdown. If instead it's several OOMs more, and moreover the training runs are still of the same type of AI system (or something even more powerful) as today (big multimodal LLMs) then I'll very happily say I was wrong.

Re humility and caution: Humility and caution should push in both directions, not just one. If your best guess is that AGI is X years away, adding an extra dose of uncertainty should make you fatten both tails of your distribution -- maybe it's 2X years away, but maybe instead it's X/2 years away.

(Exception is for planning fallacy stuff -- there we have good reason to think people are systematically biased toward shorter timelines. So if your AGI timelines are primarily based on planning out a series of steps, adding more uncertainty should systematically push your timelines farther out.)

Another thing to mention re humility and caution is that it's very very easy for framing effects to bias your judgments of who is being humble and who isn't. For one thing it's easy to appear more humble than you are simply by claiming to be so. I could have preceded many of my sentences above with "I think we should be more cautious than that..." for example. For another thing when three people debate the middle person has an aura of humility and caution simply because they are the middle person. Relatedly when someone has a position which disagrees with the common wisdom, that position is unfairly labelled unhumble/incautious even when it's the common wisdom that is crazy.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2023-11-17T19:43:11.364Z · LW · GW

I lurk in the discord for The Treacherous Turn, a ttrpg made by some AI Safety Camp people I mentored. It's lovely. I encourage everyone to check it out.

Anyhow recently someone asked for ideas for Terminal Goals an AGI might have in a realistic setting; my answer is below and I'm interested to hear whether people here agree or disagree with it:

Insofar as you want it to be grounded, which you might not want, here are some hypotheses people in AI alignment would toss around as to what would actually happen: (1) The AGI actually has exactly the goals and deontological constraints the trainers intended it to have. Insofar as they were training it to be honest, for example, it actually is honest in basically the human way, insofar as they were training it to make money for them, it deeply desires to make money for them, etc. Depending on how unethical and/or philosophically-careless the trainers were, such an AI could still be very bad news for the world.

(2) The AGI has some goals that caused it to perform well in training so far, but are not exactly what the trainers intended. For example perhaps the AGI's concepts of honesty and profit are different from the trainer's concepts, and now the AGI is smart enough to realize this, but it's too late because the AGI wants to be honest and profitable according to its definitions, not according to the trainer's definitions. For the RPG you could model this probably by doing a bit of mischevous philosophy and saying how the AGIs goal-concepts diverge from their human counterparts, e.g. 'the AGI only considers it dishonest if it's literally false in the way the AGI normally would interpret it, not if it's true-but-misleading or false-according-to-how-humans-might-interpret-it.' Another sub-type of this is that the AGI has additional goals besides the ones the trainers intended, e.g. they didn't intend for it to have a curiousity drive or a humans-trust-and-respect-me drive or a survival drive, but those drives were helpful for it in training so it has them now, even though they often conflict with the drives the trainers intended.

(3) The AGI doesn't really care at all about the stuff the trainers intended it to. Instead it just wants to get rewarded. It doesn't care about being honest, for example, it just wants to appear honest. Or worse it just wants the training process to continue assigning it high scores, even if the reason this is happening is because the trainers are dead and their keyboards have been hacked. (4) The AGI doesn't really care at all about the stuff the trainers intended it to. Instead it has some extremely simple goals, the simplest possible goals that motivate it to think strategically and acquire power. Maybe something like survival. idk. 

My rough sense is that AI alignment experts tend to think that 2 and 3 are the most likely outcomes of today's training methods, with 1 being in third place and 4 being in fourth place. But there's no consensus.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI #38: Let’s Make a Deal · 2023-11-17T18:55:52.915Z · LW · GW

I am trying to come up with a reason this isn’t 99%? Why this is the hard step at all?

tl;dr (pls correct me if this summary is wrong) the argument is that the more likely hypothesis is that the model will be simply myopically obsessed with getting reward. (Though I think Joe might still think that the model also might be actually alignd/honest/etc.? Not sure why, would need to reread.)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on A to Z of things · 2023-11-17T05:31:52.713Z · LW · GW


Why is the picture for "reference class" a bunch of seals? 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Evaluating the historical value misspecification argument · 2023-11-16T14:32:18.635Z · LW · GW

Bumping this in case you have more energy to engage now!

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI Timelines · 2023-11-15T18:09:11.808Z · LW · GW

Civil engineering projects and bringing consumer products to market are both exactly the sort of thing the planning fallacy applies to. I would just say what you've experienced is the planning fallacy, then. (It's not about the world, it's about our methods of forecasting -- when forecasting how long it will take to complete a project, humans seem to be systematically biased towards optimism.)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI Timelines · 2023-11-14T23:15:42.620Z · LW · GW

I think unknown unknowns are a different phenomenon than Hofstadter's Law / Planning Fallacy. My thinking on unknown unknowns is that they should make people spread out their timelines distribution, so that it has more mass later than they naively expect, but also more mass earlier than they naively expect. (Just as there are unknown potential blockers, there are unknown potential accelerants.) Unfortunately I think many people just do the former and not the latter, and this is a huge mistake.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on They are made of repeating patterns · 2023-11-14T18:42:36.378Z · LW · GW

We know that a good optimizer of outcomes over systems' states should have a model of the system inside of itself.

Fixing The Good Regulator Theorem — LessWrong

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI Timelines · 2023-11-14T06:43:08.826Z · LW · GW

Nice analysis. Some thoughts:

1. If scaling continues with something like Chinchilla scaling laws, the 300x multiplier for compute will not be all lumped into increasing parameters / inference cost. Instead it'll be split roughly half and half. So maybe 20x more data/trainingtime and 15x more parameters/inference cost. So, instead of $200/hr, we are talking more like $15/hr.

2. Hardware continues to improve in the near term; FLOP/$ continues to drop. As far as I know. Of course during AI boom times the price will be artificially high due to all the demand... Not sure which direction the net effect will be.

3. Reaching human-level AI might involve trading off inference compute and training compute, as discussed in Davidson's model (see and linked report) which probably is a factor that increases inference compute of the first AGIs (while shortening timelines-to-AGI) perhaps by multiple OOMs.

4. However much it costs, labs will be willing to pay. An engineer that works 5x, 10x, 100x faster than a human is incredibly valuable, much more valuable than if they worked only at 1x speed like all the extremely high-salaried human engineers at AI labs.


Comment by Daniel Kokotajlo (daniel-kokotajlo) on Game Theory without Argmax [Part 2] · 2023-11-13T14:02:40.962Z · LW · GW

e upshot of higher-order game theory — the nash equilibria between optimisers is itself an optimiser!

Isn't this pretty trivial though? I guess it's still probably convenient for the math.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI Timelines · 2023-11-12T06:58:38.756Z · LW · GW

I think you aren't engaging with the reasons why smart people think that 1000x energy consumption could happen soon. It's all about the growth rates. Obviously anything that looks basically like human industrial society won't be getting to 1000x in the next 20 years; the concern is that a million superintelligences commanding an obedient human nation-state might be able to design a significantly faster-growing economy. For an example of how I'm thinking about this, see this comment.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI Timelines · 2023-11-11T00:55:31.311Z · LW · GW

Yep, I love betting about stuff like that. Got any suggestions for how to objectively operationalize it? Or a trusted third party arbiter?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI Timelines · 2023-11-10T22:50:22.867Z · LW · GW

As previously discussed a couple times on this website, it's not rational for me to make bets on my beliefs about these things. Because I either won't be around to collect if I win, or won't value the money nearly as much. And because I can get better odds on the public market simply by taking out a loan.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on GPT-2030 and Catastrophic Drives: Four Vignettes · 2023-11-10T19:19:00.923Z · LW · GW

k 5x as quickly

Why only 5x? Didn't you yourself argue it could easily be made to go much faster than that?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Concrete positive visions for a future without AGI · 2023-11-10T19:16:20.201Z · LW · GW

Look, I'm not here to argue about the long-term trajectory of space flight with you, I'm here to object to your false and misleading claim about SpaceX. If you concede that point then I'll go away.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI Timelines · 2023-11-10T18:21:47.905Z · LW · GW


(a) 99% remotable 2023 tasks automateable (the thing we forecast in the OP)
(b) 99% 2023 tasks automatable
(c) 99% 2023 tasks automated
(d) Overpower ability

My best guess at the ordering will be a->d->b->c. 

Rationale: Overpower ability probably requires something like a fully functioning general purpose agent capable of doing hardcore novel R&D. So, (a). However it probably doesn't require sophisticated robots, of the sort you'd need to actually automate all 2023 tasks. It certainly doesn't require actually having replaced all human jobs in the actual economy, though for strategic reasons a coalition of powerful misaligned AGIs would plausibly wait to kill the humans until they had actually rendered the humans unnecessary.

My best guess is that a, d, and b will all happen in the same year, possibly within the same month. c will probably take longer for reasons sketched above.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI Timelines · 2023-11-10T17:41:25.075Z · LW · GW

Good catch. Let me try to reconstruct my reasoning:

  • I was probably somewhat biased towards a longer gap because I knew I'd be discussing with Ege who is very skeptical (I think?) that even a million superintelligences in control of the entire human society could whip it into shape fast enough to grow 1000x in less than a decade. So I probably was biased towards 'conservatism.' (in scare quotes because the direction that is conservative vs. generous is determined by what other people think, not by the evidence and facts of the case)
  • As Habryka says, I think there's a gap between 99% automatable and 99% automated. I think the gap between AI R&D being 99% automatable and being actually automated will be approximately one day, unless there is deliberate effort to slow down. But automating the world economy will take longer because there won't be enough compute to replace all the jobs, many jobs will be 'sticky' and people won't just be immediately laid off, many jobs are partially physical and thus would require robots to fully automate, robots which need to be manufactured, etc.
  • I also think there's a gap between a fully automated economy and 1000x energy consumption. Napkin math: Say your nanobots / nanofactories / optimized robo-miner-factory-complexes are capable of reproducing themselves (doubling in size) every month. And say you start with 1000 tons worth of them, produced with various human tools in various human laboratories. Then a year later you'll only have 4M tons, and a year after that 16B tons... it'll take a while to overtake the human economy and then about a year after that you get to 1000x energy consumption. Is a one month doubling time reasonable estimate? I have no idea, I could imagine it being significantly faster but also somewhat slower. (Faster scenario: Nanobots/nanofactories that are like bacteria but better. Doubling times like one hour or so. Slower scenario: The tools to build nanobots/nanofactories don't exist, so you need to build the tools to build the tools to build the tools to build them. And this just takes serial time; maybe each stage takes six months. Slower scenario: Nanobots etc. are possible but not with a doubling time measured in hours; in harsh environments like earth's oceans and surfaces, doubling time even for the best nanobots is measured in weeks. Instead of "like bacteria but better," it's "like grass but better." Even slower scenario: Nanobots/nanofactories just aren't possible even for superintelligences except maybe if they are able to do massive experiments to search through the space of all possible designs or something like that. Which they aren't. So they get by for now with ordinary robots digging and refining and manufacturing stuff, which has a doubling time of almost a year. "Like human industrial economy but better." (Tesla factories produce about their weight in cars every year I think. Rough estimate, could be off by an OOM.)
  • I'd love to see more serious analysis along the lines I sketched above, of what the plausible fastest doubling times are and how long it might take for a million ASIs with obedient human nations to get there. My current views are very uncertain and unstable.
Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI Timelines · 2023-11-10T17:21:37.050Z · LW · GW

Here's a sketch for what I'd like to see in the future--a better version of the scenario experiment done above:

  • 2-4 people sit down for a few hours together.
  • For the first 1-3 hours, they each write a Scenario depicting their 'median future' or maybe 'modal future.' The scenarios are written similarly to the one I wrote above, with dated 'stages.' The scenarios finish with superintelligence, or else it-being-clear-superintelligence-is-many-decades-away-at-least.
  • As they write, they also read over each other's scenarios and ask clarifying questions. E.g. "You say that in 2025 they can code well but unreliably -- what do you mean exactly? How much does it improve the productivity of, say, OpenAI engineers?"
  • By the end of the period, the scenarios are finished & everyone knows roughly what each stage means because they've been able to ask clarifying questions.
  • Then for the next hour or so, they each give credences for each stage of each scenario. Credences in something like "ASI by year X" where X is the year ASI happens in the scenario.
  • They also of course discuss and critique each other's credences, and revise their own.
  • At the end, hopefully some interesting movements will have happened in people's mental models and credences, and hopefully some interesting cruxes will have surfaced -- e.g. it'll be more clear what kinds of evidence would actually cause timelines updates, were they to be observed.
  • The scenarios, credences, and maybe a transcript of the discussion then gets edited and published.
Comment by Daniel Kokotajlo (daniel-kokotajlo) on Concrete positive visions for a future without AGI · 2023-11-10T17:12:20.035Z · LW · GW

"X got its start with government subsidies and contracts" is a veeeeerrry different claim from "X is not even economically viable without government subsidies." The distinction between subsidies and contracts is important, and the distinction between getting started and long-term viability is important.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Concrete positive visions for a future without AGI · 2023-11-10T05:16:46.158Z · LW · GW

So, you retract your claim that SpaceX is not economically viable without government subsidies?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Concrete positive visions for a future without AGI · 2023-11-09T18:07:04.002Z · LW · GW

What you should have said, therefore, is "Dath ilan is fiction; it's debatable whether the premises of the world would actually result in the happy conclusion depicted. However, I think it's probably directionally correct -- it does seem to me that if Eliezer was the median, the world would be dramatically better overall, in roughly the ways depicted in the story."

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Concrete positive visions for a future without AGI · 2023-11-09T18:04:17.785Z · LW · GW

SpaceX itself isn’t even economically viable without government subsidies

I'm pretty sure that's false. Starlink is a money printer and SpaceX dominates the commercial market.

And not only false, but completely backwards. The government has been much more favorable to SpaceX's competitors than to SpaceX, and on an objective scale I think if SpaceX didn't have to delay so much to get approval from FAA and Fish & Wildlife, they'd be significantly farther ahead right now.

It's probably true that SpaceX wouldn't have gotten off the ground early on without some help (mostly in the form of contracts) from the govt, but that's not what you said -- you said "isn't even economically viable," not "would have needed more VC investment to get started but would have easily paid it back by now"

As for guaranteeing rockets won't explode on launch... SpaceX is getting there. Give them another decade & they'll be there I think. They key, as they always say, is to do it so freaking many times that every possible way things can go wrong has in fact gone wrong in the past and been fixed. Like with commercial flights.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on On the UK Summit · 2023-11-07T20:07:51.565Z · LW · GW

If a true international framework is to be in place to fully check dangerous developments, eventually China will not be enough, and we will want to bring Russia and even North Korea in, although if China is fully on board that makes that task far easier.

Yeah I'm not too worried. If China and USA are both on board, Russia and NK etc. can and will be brought in line. 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on On the UK Summit · 2023-11-07T20:04:25.086Z · LW · GW

We also recommend defining clear red lines that, if crossed, mandate immediate termination of an AI system — including all copies — through rapid and safe shut-down procedures. Governments should cooperate to instantiate and preserve this capacity. Moreover, prior to deployment as well as during training for the most advanced models, developers should demonstrate to regulators’ satisfaction that their system(s) will not cross these red lines.

Aw hell yeah

Comment by Daniel Kokotajlo (daniel-kokotajlo) on On the UK Summit · 2023-11-07T20:03:52.740Z · LW · GW

Moreover, there is a serious risk that future AI systems may escape human control altogether.

Glad this sentence got in there, especially when paired with the next two sentences.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on On the UK Summit · 2023-11-07T20:02:43.787Z · LW · GW

Prof. Bengio called upon AI developers to “demonstrate the safety of their approach before training and deploying” AI systems, while Prof. Russell concurred that “if they cannot do that, they cannot build or deploy their systems. Full stop.”

Whoa this is actually great! Not safetywashing at all; I'd consider it significant progress if these statements get picked up and repeated a lot and enshrined in Joint Statements.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Preventing Language Models from hiding their reasoning · 2023-11-07T19:48:32.125Z · LW · GW

Ah OK, thanks! I think I get it now.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on leogao's Shortform · 2023-11-07T00:05:03.703Z · LW · GW

Continuing the analogy:

You'd want there to be a Tribe, or perhaps two or more Tribes, that aggressively detect and smack down any tribalism that isn't their own. It needs to be the case that e.g. when some academic field starts splintering into groups that stereotype and despise each other, or when people involved in the decision whether to X stop changing their minds frequently and start forming relatively static 'camps,' the main Tribe(s) notice this and squash it somehow. 

And/or maybe arrange things so it never happens in the first place.

I wonder if this sorta happens sometimes when there is an Official Religion?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2023-11-06T23:52:46.896Z · LW · GW

I remember being interested (and maybe slightly confused) when I read about the oft-bloody transition from hereditary monarchies to democracies and dictatorships. Specifically it interested me that so many smart, reasonable, good people seemed to be monarchists. Even during anarchic periods of civil war, the factions tended to rally around people with some degree of legitimate claim to the throne, instead of the whole royal lineage being abandoned and factions arising based around competence and charisma. Did these smart people literally believe in some sort of magic, some sort of divine power imbued in a particular bloodline? Really? But there are so many cases where the last two or three representatives of that bloodline have been obviously incompetent and/or evil!

Separately, I always thought it was interesting (and frustrating) how the two-party system in the USA works. As satirized by The Simpsons: 

Like, I'm pretty sure there are at least, idk, a thousand eligible persons X in the USA such that >50% of voters would prefer X to be president than both Trump and Biden. (e.g. someone who most Democrats think is slightly better than Biden AND most Republicans think is slightly better than Trump) So why doesn't one of those thousand people run for president and win? (This is a rhetorical question, I know the answer)

It occurs to me that maybe these things are related. Maybe in a world of monarchies where the dynasty of so-and-so has ruled for generations, supporting someone with zero royal blood is like supporting a third-party candidate in the USA.