What exactly is GPT-3's base objective? 2021-11-10T00:57:35.062Z
P₂B: Plan to P₂B Better 2021-10-24T15:21:09.904Z
Blog Post Day IV (Impromptu) 2021-10-07T17:17:39.840Z
Is GPT-3 already sample-efficient? 2021-10-06T13:38:36.652Z
Growth of prediction markets over time? 2021-09-02T13:43:38.869Z
What 2026 looks like 2021-08-06T16:14:49.772Z
How many parameters do self-driving-car neural nets have? 2021-08-06T11:24:59.471Z
Two AI-risk-related game design ideas 2021-08-05T13:36:38.618Z
Did they or didn't they learn tool use? 2021-07-29T13:26:32.031Z
How much compute was used to train DeepMind's generally capable agents? 2021-07-29T11:34:10.615Z
DeepMind: Generally capable agents emerge from open-ended play 2021-07-27T14:19:13.782Z
What will the twenties look like if AGI is 30 years away? 2021-07-13T08:14:07.387Z
Taboo "Outside View" 2021-06-17T09:36:49.855Z
Vignettes Workshop (AI Impacts) 2021-06-15T12:05:38.516Z
ML is now automating parts of chip R&D. How big a deal is this? 2021-06-10T09:51:37.475Z
What will 2040 probably look like assuming no singularity? 2021-05-16T22:10:38.542Z
How do scaling laws work for fine-tuning? 2021-04-04T12:18:34.559Z
Fun with +12 OOMs of Compute 2021-03-01T13:30:13.603Z
Poll: Which variables are most strategically relevant? 2021-01-22T17:17:32.717Z
Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain 2021-01-18T12:08:13.418Z
How can I find trustworthy dietary advice? 2021-01-17T13:11:54.158Z
Review of Soft Takeoff Can Still Lead to DSA 2021-01-10T18:10:25.064Z
DALL-E by OpenAI 2021-01-05T20:05:46.718Z
Dario Amodei leaves OpenAI 2020-12-29T19:31:04.161Z
Against GDP as a metric for timelines and takeoff speeds 2020-12-29T17:42:24.788Z
How long till Inverse AlphaFold? 2020-12-17T19:56:14.474Z
Incentivizing forecasting via social media 2020-12-16T12:15:01.446Z
What are the best precedents for industries failing to invest in valuable AI research? 2020-12-14T23:57:08.631Z
What technologies could cause world GDP doubling times to be <8 years? 2020-12-10T15:34:14.214Z
The AI Safety Game (UPDATED) 2020-12-05T10:27:05.778Z
Is this a good way to bet on short timelines? 2020-11-28T12:51:07.516Z
Persuasion Tools: AI takeover without AGI or agency? 2020-11-20T16:54:01.306Z
How Roodman's GWP model translates to TAI timelines 2020-11-16T14:05:45.654Z
How can I bet on short timelines? 2020-11-07T12:44:20.360Z
What considerations influence whether I have more influence over short or long timelines? 2020-11-05T19:56:12.147Z
AI risk hub in Singapore? 2020-10-29T11:45:16.096Z
The date of AI Takeover is not the day the AI takes over 2020-10-22T10:41:09.242Z
If GPT-6 is human-level AGI but costs $200 per page of output, what would happen? 2020-10-09T12:00:36.814Z
Where is human level on text prediction? (GPTs task) 2020-09-20T09:00:28.693Z
Forecasting Thread: AI Timelines 2020-08-22T02:33:09.431Z
What if memes are common in highly capable minds? 2020-07-30T20:45:17.500Z
What a 20-year-lead in military tech might look like 2020-07-29T20:10:09.303Z
Does the lottery ticket hypothesis suggest the scaling hypothesis? 2020-07-28T19:52:51.825Z
Probability that other architectures will scale as well as Transformers? 2020-07-28T19:36:53.590Z
Lessons on AI Takeover from the conquistadors 2020-07-17T22:35:32.265Z
What are the risks of permanent injury from COVID? 2020-07-07T16:30:49.413Z
Relevant pre-AGI possibilities 2020-06-20T10:52:00.257Z
Image GPT 2020-06-18T11:41:21.198Z
List of public predictions of what GPT-X can or can't do? 2020-06-14T14:25:17.839Z
Preparing for "The Talk" with AI projects 2020-06-13T23:01:24.332Z


Comment by Daniel Kokotajlo (daniel-kokotajlo) on Biology-Inspired AGI Timelines: The Trick That Never Works · 2021-12-02T18:56:32.921Z · LW · GW

I feel like a big crux is whether Platt's Law is true:

Eliezer:  I mean, in fact, part of my actual sense of indignation at this whole affair, is the way that Platt's law of strong AI forecasts - which was in the 1980s generalizing "thirty years" as the time that ends up sounding "reasonable" to would-be forecasters - is still exactly in effect for what ends up sounding "reasonable" to would-be futurists, in fricking 2020 while the air is filling up with AI smoke in the silence of nonexistent fire alarms.

Didn't AI Impacts look into this a while back? See e.g. this dataset. Below is one of the graphs:


Comment by Daniel Kokotajlo (daniel-kokotajlo) on Biology-Inspired AGI Timelines: The Trick That Never Works · 2021-12-02T18:52:58.054Z · LW · GW

I think this does not do justice to Ajeya's bio anchors model. Carl already said the important bits, but here are some more points:

But, if you insist on the error of anchoring on biology, you could perhaps do better by seeing a spectrum between two bad anchors.  This lets you notice a changing reality, at all, which is why I regard it as a helpful thing to say to you and not a pure persuasive superweapon of unsound argument.  Instead of just fixating on one bad anchor, the hybrid of biological anchoring with whatever knowledge you currently have about optimization, you can notice how reality seems to be shifting between two biological bad anchors over time, and so have an eye on the changing reality at all.  Your new estimate in terms of gradient descent is stepping away from evolutionary computation and toward the individual-brain estimate by ten orders of magnitude, using the fact that you now know a little more about optimization than natural selection knew; and now that you can see the change in reality over time, in terms of the two anchors, you can wonder if there are more shifts ahead.

This is exactly what the bio anchors framework is already doing? It has the lifetime anchor on one end, and the evolution anchor on the other end, and almost all probability mass is in between, and then it has a parameter for how that mass shifts leftwards over time as new ideas come along. I do agree that the halving-of-compute-costs-every-2.5-years estimate seems too slow to me; it seems like that's the rate of "normal incremental progress" but that when you account for the sort of really important ideas (or accumulations of ideas, or shifts in research direction towards more fruitful paths) that happen about once a decade, the rate should be faster than that. I think this because when I imagine what the field of AI looks like in 2040, I have a hard time believing it looks anything like the sort of paradigm the medium-horizon or long-horizon anchors are built around, with big neural nets trained by gradient descent etc. I think that instead something significantly better/more capable/more efficient will have been found by then. (And I think, unfortunately, that we don't really have much room for further improvement before we get to AGI! If, say, current methods have a 50% chance of working, then significantly-better-than-current-methods should bring our credence up to well over 50%.)

Realistically, though, I would not recommend eyeballing how much more knowledge you'd think you'd need to get even larger shifts, as some function of time, before that line crosses the hardware line.  Some researchers may already know Thielian secrets you do not, that take those researchers further toward the individual-brain computational cost (if you insist on seeing it that way).  That's the direction that economics rewards innovators for moving in, and you don't know everything the innovators know in their labs.
When big inventions finally hit the world as newspaper headlines, the people two years before that happens are often declaring it to be fifty years away; and others, of course, are declaring it to be two years away, fifty years before headlines.  Timing things is quite hard even when you think you are being clever; and cleverly having two biological anchors and eyeballing Reality's movement between them, is not the sort of cleverness that gives you good timing information in real life.
In real life, Reality goes off and does something else instead, and the Future does not look in that much detail like the futurists predicted.  In real life, we come back again to the same wiser-but-sadder conclusion given at the start, that in fact the Future is quite hard to foresee - especially when you are not on literally the world's leading edge of technical knowledge about it, but really even then.  If you don't think you know any Thielian secrets about timing, you should just figure that you need a general policy which doesn't get more than two years of warning, or not even that much if you aren't closely non-dismissively analyzing warning signs.

This seems true but changing the subject. Insofar as the subject is "what should our probability distribution over date-of-AGI-creation look like" then Ajeya's framework (broadly construed) is the right way to think about it IMO. Separately, we should worry that this will never let us predict with confidence that it is happening in X years, and thus we should be trying to have a general policy that lets us react quickly to e.g. two years of warning.

OpenPhil:  I don't understand how some of your reasoning could be internally consistent even on its own terms.  ... You can either say that our forecasted pathway to AGI or something very much like it would probably work in principle without requiring very much more computation than our uncertain model components take into account, meaning that the probability distribution provides a soft upper bound on reasonably-estimable arrival times, but that paradigm shifts will predictably provide an even faster way to do it before then.  That is, you could say that our estimate is both a soft upper bound and also a directional overestimate.  Or, you could say that our ignorance of how to create AI will consume more than one order-of-magnitude of increased computation cost above biology -
Eliezer:  Indeed, much as your whole proposal would supposedly cost ten trillion times the equivalent computation of the single human brain that earlier biologically-inspired estimates anchored on.
OpenPhil:  - in which case our 2050-centered distribution is not a good soft upper bound, but also not predictably a directional overestimate.  Don't you have to pick one or the other as a critique, there?

I think OpenPhil is totally right here. My own stance is that the 2050-centered distribution is a directional overestimate because e.g. the long-horizon anchor is a soft upper bound (in fact I think the medium-horizon anchor is a soft upper bound too, see Fun with +12 OOMs.)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Biology-Inspired AGI Timelines: The Trick That Never Works · 2021-12-02T17:55:16.447Z · LW · GW

I'm especially keen to hear responses to this point:

Eliezer:  Backtesting this viewpoint on the previous history of computer science, it seems to me to assert that it should be possible to:
Train a pre-Transformer RNN/CNN-based model, not using any other techniques invented after 2017, to GPT-2 levels of performance, using only around 2x as much compute as GPT-2;
Play pro-level Go using 8-16 times as much computing power as AlphaGo, but only 2006 levels of technology.
Your model apparently suggests that we have gotten around 50 times more efficient at turning computation into intelligence since that time; so, we should be able to replicate any modern feat of deep learning performed in 2021, using techniques from before deep learning and around fifty times as much computing power.
OpenPhil:  No, that's totally not what our viewpoint says when you backfit it to past reality.  Our model does a great job of retrodicting past reality.

My guess is that Ajeya / OpenPhil would say "The halving-in-costs every 2.5 years is on average, not for everything. Of course there are going to be plenty of things for which algorithmic progress has been much faster. There are also things for which algorithmic progress has been much slower. And we didn't pull 2.5 out of our ass, we got it from fitting to past data."

This seems to rebut the specific point EY made but also seems to support his more general skepticism about this method. What we care about is algorithmic progress relevant to AGI or APS-AI, and if that could be orders of magnitude faster or slower than halving every 2.5 years...

Comment by Daniel Kokotajlo (daniel-kokotajlo) on The Finale of the Ultimate Meta Mega Crossover · 2021-12-02T04:00:45.502Z · LW · GW

Or maybe when the Blight fleet reaches the Countermeasure it can undo it and revive the Blight, so by delaying the fleet arrival they make it possible for the Tines to build up fleets of their own and defend the Countermeasure?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on The Finale of the Ultimate Meta Mega Crossover · 2021-12-02T03:37:05.039Z · LW · GW

It's been an honor.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2021-11-30T00:45:12.353Z · LW · GW

Right. So, what do you think about the AI-timelines-related claim then? Will we need medium or long-horizon training for a number of episodes within an OOM or three of parameter count to get something x-risky?

ETA: To put it more provocatively: If EfficientZero can beat humans at Atari using less game experience starting from a completely blank slate whereas humans have decades of pre-training, then shouldn't a human-brain-sized EfficientZero beat humans at any intellectual task given decades of experience at those tasks + decades of pre-training similar to human pre-training.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Discussion with Eliezer Yudkowsky on AGI interventions · 2021-11-29T21:17:56.443Z · LW · GW

It's not consensus. Ajeya, Richard, Paul, and Rohin are prominent examples of people widely considered to have expertise on this topic who think it's not true. (I think they'd say something more like 10% chance? IDK)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2021-11-29T14:57:25.004Z · LW · GW

I used to think that current AI methods just aren't nearly as sample/data - efficient as humans. For example, GPT-3 had to read 300B tokens of text whereas humans encounter 2 - 3 OOMs less, various game-playing AIs had to play hundreds of years worth of games to get gud, etc.

Plus various people with 20 - 40 year AI timelines seem to think it's plausible -- in fact, probable -- that unless we get radically new and better architectures, this will continue for decades, meaning that we'll get AGI only when we can actually train AIs on medium or long-horizon tasks for a ridiculously large amount of data/episodes.

So EfficientZero came as a surprise to me, though it wouldn't have surprised me if I had been paying more attention to that part of the literature.

What gives?

Inspired by this comment:

in linguistic there is an argument called the poverty of stimulus. The claim is that children must figure out the rules of language using only a limited number of unlabeled examples. This is taken as evidence that the brain has some kind of hard-wired grammar framework, that serves as a canvas for further learning while growing up. 
Is it possible that tools like EfficientZero help find the fundamental limits for how much training data you need to figure out a set of rules? If an artificial neural network ever manages to reconstruct the rules of English by using only the stimulus that the average children is exposed too, that would be a strong counter-argument against poverty of stimulus.
Comment by Daniel Kokotajlo (daniel-kokotajlo) on Christiano, Cotra, and Yudkowsky on AI progress · 2021-11-27T20:18:56.968Z · LW · GW

Is that one dense or sparse/MoE? How many data points was it trained for? Does it set SOTA on anything? (I'm skeptical; I'm wondering if they only trained it for a tiny amount, for example.)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on EfficientZero: How It Works · 2021-11-27T11:53:39.717Z · LW · GW

Thank you so much for writing this! Strong-upvoted.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Yudkowsky and Christiano discuss "Takeoff Speeds" · 2021-11-26T13:06:48.654Z · LW · GW

That's helpful, thanks!

To be clear, I think that if EY put more effort into it (and perhaps had some help from other people as RAs) he could write a book or sequence rebutting Paul & Katja much more thoroughly and convincingly than this post did. [ETA: I.e. I'm much more on Team Yud than Team Paul here.] The stuff said here felt like a rehashing of stuff from IEM and the Hanson-Yudkowsky AI foom debate to me. [ETA: Lots of these points were good! Just not surprising to me, and not presented as succinctly and compellingly (to an audience of me) as they could have been.]

Also, it's plausible that a lot of what's happening here is that I'm conflating my own cruxes and confusions for The Big Points EY Objectively Should Have Covered To Be More Convincing. :)

ETA: And the fact that people updated towards EY on average, and significantly so, definitely updates me more towards this hypothesis!

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Yudkowsky and Christiano discuss "Takeoff Speeds" · 2021-11-26T01:48:30.148Z · LW · GW

I think I was expecting somewhat better from EY; I was expecting more solid, well-explained arguments/rebuttals to Paul's points from "Takeoff Speeds." Also EY seemed to be angry and uncharitable, as opposed to calm and rational. I was imagining an audience that mostly already agrees with Paul encountering this and being like "Yeah this confirms what we already thought."

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Yudkowsky and Christiano discuss "Takeoff Speeds" · 2021-11-25T17:09:14.710Z · LW · GW

My prediction was mainly about polarization rather than direction, but I would have expected the median or average to not move much probably, and to be slightly more likely to move towards Paul than towards Yudkowsky. I think. I don't think I was very surprised.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Ngo and Yudkowsky on alignment difficulty · 2021-11-25T11:55:13.859Z · LW · GW
Understand the work before understanding the engines; nearly every key concept here is implicit in the notion of work rather than in the notion of a particular kind of engine."

I don't know the relevant history of science, but I wouldn't be surprised if something like the opposite was true: Our modern, very useful understanding of work is an abstraction that grew out of many people thinking concretely about various engines. Thinking about engines was like the homework exercises that helped people to reach and understand the concept of work.

Similarly, perhaps it is pedagogically (and conceptually) helpful to begin with the notion of a consequentialist and then generalize to outcome pumps.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Yudkowsky and Christiano discuss "Takeoff Speeds" · 2021-11-25T11:21:35.462Z · LW · GW

Wow, I did not expect those results!

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Coordinating the Unequal Treaties · 2021-11-25T11:00:03.728Z · LW · GW

Huh, that's interesting & good to know. Seems that Most Favored Nation is very much still a thing today:

Does it perhaps have an advantage for the Japanese, namely that the four powers will be less motivated to demand concessions because said concessions would also go to their rivals?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Discussion with Eliezer Yudkowsky on AGI interventions · 2021-11-24T18:55:43.993Z · LW · GW

I don't think they'd even need to be raised to think that; they'd figure it out on their own. Unfortunately we don't have enough time.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Yudkowsky and Christiano discuss "Takeoff Speeds" · 2021-11-24T15:16:28.974Z · LW · GW

Hot damn, where can I see these preliminary results?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Yudkowsky and Christiano discuss "Takeoff Speeds" · 2021-11-24T14:07:51.362Z · LW · GW

Sorry! I'll go back and insert links + reference your comment

Comment by Daniel Kokotajlo (daniel-kokotajlo) on What exactly is GPT-3's base objective? · 2021-11-24T11:13:55.152Z · LW · GW

Ahhh, OK. Then perhaps I just was using inappropriate words; it sounds like what I meant to refer to by 4 was the same as what you meant to refer to by 3.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Yudkowsky and Christiano discuss "Takeoff Speeds" · 2021-11-23T22:02:15.998Z · LW · GW

Fair enough! I too dislike premature meta, and feel bad that I engaged in it. However... I do still feel like my comment probably did more to prevent polarization than cause it? That's my independent impression at any rate. (For the reasons you mention).

I certainly don't want to give up! In light of your pushback I'll edit to add something at the top.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Yudkowsky and Christiano discuss "Takeoff Speeds" · 2021-11-23T17:44:35.933Z · LW · GW

Yes, though I'm much more comfortable explaining and arguing for my own position than EY's. It's just that my position turns out to be pretty similar. (Partly this is independent convergence, but of course partly this is causal influence since I've read a lot of his stuff.)

There's a lot to talk about, I'm not sure where to begin, and also a proper response would be a whole research project in itself. Fortunately I've already written a bunch of it; see these two sequences.

Here are some quick high-level thoughts:

1. Begin with timelines. The best way to forecast timelines IMO is Ajeya's model; it should be the starting point and everything else should be adjustments from it. The core part of Ajeya's model is a probability distribution over how many OOMs of compute we'd need with today's ideas to get to TAI / AGI / APS-AI / AI-PONR / etc. [Unfamiliar with these acronyms? See Robbo's helpful comment below] For reasons which I've explained in my sequence (and summarized in a gdoc) my distribution has significantly more mass on the 0-6 OOM range than Paul does, and less on the 13+ range. The single post that conveys this intuition most is Fun with +12 OOMs.

Now consider how takeoff speed views interact with timelines views. Paul-slow takeoff and <10 year timelines are in tension with each other. If <7 OOMs of compute would be enough to get something crazy powerful with today's ideas, then the AI industry is not an efficient market right now. If we get human-level AGI in 2030, then on Paul's view that means the world economy should be doubling in 2029 and should have doubled over the course of 2025 - 2028 and should already be accelerating now probably. It doesn't look like that's happening or about to happen. I think Paul agrees with this; in various conversations he's said things like "If AGI happens in 10 years or less then probably we get fast takeoff." [Paul please correct me if I'm mischaracterizing your view!]

Ajeya (and Paul) mostly update against <10 year timelines for this reason. I, by contrast, mostly update against slow takeoff. (Obviously with both do a bit of both, like good Bayesians.)

2. I feel like the debate between EY and Paul (and the broader debate about fast vs. slow takeoff) has been frustratingly much reference class tennis and frustratingly little gears-level modelling. This includes my own writing on the subject -- lots of historical analogies and whatnot. I've tentatively attempted some things sorta like gears-level modelling (arguably What 2026 Looks Like is an example of this) and so far it seems to be pushing my intuitions more towards "Yep, fast takeoff is more likely." But I feel like my thinking on this is super inadequate and I think we all should be doing better. Shame! Shame on all of us!

3. I think the focus on GDP (especially GWP) is really off, for reasons mentioned here. I think AI-PONR will probably come before GWP accelerates, and at any rate what we care about for timelines and takeoff speeds is AI-PONR and so our arguments should be about e.g. whether there will be warning shots and powerful AI tools of the sort that are relevant to solving alignment for APS-AI systems.

(Got to go now)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Yudkowsky and Christiano discuss "Takeoff Speeds" · 2021-11-23T11:46:28.410Z · LW · GW

[ETA: In light of pushback from Rob: I really don't want this to become a self-fulfilling prophecy. My hope in making this post was to make the prediction less likely to come true, not more! I'm glad that MIRI & Eliezer are publicly engaging with the rest of the community more again, I want that to continue, and I want to do my part to help everybody to understand each other.]

And I know, before anyone bothers to say, that all of this reply is not written in the calm way that is right and proper for such arguments. I am tired. I have lost a lot of hope. There are not obvious things I can do, let alone arguments I can make, which I expect to be actually useful in the sense that the world will not end once I do them. I don't have the energy left for calm arguments. What's left is despair that can be given voice.

I grimly predict that the effect of this dialogue on the community will be polarization: People who didn't like Yudkowsky and/or his views will like him / his views less, and the gap between them and Yud-fans will grow (more than it shrinks due to the effect of increased dialogue). I say this because IMO Yudkowsky comes across as angry and uncharitable in various parts of this dialogue, and also I think it was kinda a slog to get through & it doesn't seem like much intellectual progress was made here.

FWIW I continue to think that Yudkowskys model of how the future will go is basically right, at least more right than Christiano's. This is a big source of sadness and stress for me too, because (for example) my beloved daughter probably won't live to adulthood.

The best part IMO was the mini-essay at the end about Thielian secrets and different kinds of tech progress -- a progression of scenarios adding up to Yudkowsky's understanding of Paul's model:

But we can imagine that doesn't happen either, because instead of needing to build a whole huge manufacturing plant, there's just lots and lots of little innovations adding up to every key AGI threshold, which lots of actors are investing $10 million in at a time, and everybody knows which direction to move in to get to more serious AGI and they're right in this shared forecast.

It does seem to me that the AI industry will move more in this direction than it currently is, over the next decade or so. However I still do expect that we won't get all the way there. I would love to hear from Paul whether he endorses the view Yudkowsky attributes to him in this final essay.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Ngo and Yudkowsky on alignment difficulty · 2021-11-22T15:22:12.376Z · LW · GW

For (a): Deception is a convergent instrumental goal; you get it “for free” when you succeed in making an effective system, in the sense that the simplest, most-likely-to-be-randomly-generated effective systems are deceptive. Corrigibility by contrast is complex and involves making various nuanced decisions between good and bad sorts of influence on human behavior.

For (b): If you take an effective system and modify it to be corrigible, this will tend to make it less effective. By contrast, deceptiveness (insofar as it arises “naturally” as a byproduct of pursuing convergent instrumental goals effectively) does not “get in the way” of effectiveness, and even helps in some cases!

Ngo’s (and Shah’s) position (we think) is that the data we’ll be using to select our systems will be heavily entangled with human preferences - we’ll indeed be trying to use human preferences to guide and shape the systems - so there’s a strong bias towards actually learning them. You don’t have to get human preferences right in all their nuance and detail to know some basic things like that humans generally don’t want to die or be manipulated/deceived. I think they mostly bounce off the claim that “effectiveness” has some kind of “deep underlying principles” that will generalise better than any plausible amount of human preference data actually goes into building the effective system. We imagine Shah saying: “1. Why will the AI have goals at all?, and 2. If it does have goals, why will its goals be incompatible with human survival? Sure, most goals are incompatible with human survival, but we’re not selecting uniformly from the space of all goals.”

It seems to us that Ngo, Shah, etc. draw intuitive support from analogy to humans, whereas Yudkowsky etc. draw intuitive support from the analogy to programs and expected utility equations.

If you are thinking about a piece of code that describes a bayesian EU-maximizer, and then you try to edit the code to make the agent corrigible, it’s obvious that (a) you don’t know how to do that, and (b) if you did figure it out the code you add would be many orders of magnitude longer than the code you started with.

If instead you are thinking about humans, it seems like you totally could be corrigible if you tried, and it seems like you might totally have tried if you had been raised in the right way (e.g. if your parents had lovingly but strictly trained you to be corrigible-in-way-X.)

We think Yudkowsky’s response to this apparent counterexample is that humans are stupid, basically; AIs might be similarly stupid at first, but as they get smarter we should expect crude corrigibility-training techniques to stop working.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Discussion with Eliezer Yudkowsky on AGI interventions · 2021-11-22T14:27:48.077Z · LW · GW

EY knows more neuroscience than me (I know very little) but here's a 5-min brainstorm of ideas:

--For a fixed compute budget, spend more of it on neurons associated with higher-level thought (the neocortex?) and less of it on neurons associated with e.g. motor control or vision.

--Assuming we are an upload of some sort rather than a physical brain, tinker with the rules a bit so that e.g. neuron waste products get magically deleted instead of having to be pumped out, neurons never run out of energy/oxygen and need to rest, etc. Study situations where you are in "peak performance" or "flow" and then explore ways to make your brain enter those states at will.

--Use ML pruning techniques to cut away neurons that aren't being useful, to get slightly crappier mini-Eliezers that cost 10% the compute. These can then automate away 90% of your cognition, saving you enough compute that you can either think a few times faster or have a few copies running in parallel.

--Build automated tools that search through your brain for circuits that are doing something pretty simple, like a giant OR gate or an oscillator, and then replace those circuits with small bits of code, thereby saving significant compute. If anything goes wrong, no worries, just revert to backup.

This was a fun exercise!

Comment by Daniel Kokotajlo (daniel-kokotajlo) on LCDT, A Myopic Decision Theory · 2021-11-22T12:11:18.855Z · LW · GW
Myopia is the property of a system to not plan ahead, to not think too far about the consequences of its actions, and to do the obvious best thing in the moment instead of biding its time.

This seems inconsistent with how you later use the term. Don't you nowadays say that we could have a myopic imitator of HCH, or even a myopic Evan-imitator? But such a system would need to think about the long-term consequences of its actions in order to imitate HCH or Evan, since HCH / Evan would be thinking about those things.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on What exactly is GPT-3's base objective? · 2021-11-22T11:30:36.023Z · LW · GW

Why do you choose answer 3 instead of answer 4? In some sense answer 3 is the random weights that the developers intended, but answer 4 is what actually happened.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Ngo and Yudkowsky on alignment difficulty · 2021-11-19T10:09:02.338Z · LW · GW

To be clear I think I agree with your overall position. I just don't think the argument you gave for it (about bureaucracies etc.) was compelling.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Ngo and Yudkowsky on alignment difficulty · 2021-11-17T11:25:57.897Z · LW · GW

[Notes mostly to myself, not important, feel free to skip]

My hot take overall is that Yudkowsky is basically right but doing a poor job of arguing for the position. Ngo is very patient and understanding.

"it doesn't seem implausible to me that we build AIs that are significantly more intelligent (in the sense of being able to understand the world) than humans, but significantly less agentic." --Ngo

"It is likely that, before the point where AGIs are strongly superhuman at seeking power, they will already be strongly superhuman at understanding the world, and at performing narrower pivotal acts like alignment research which don’t require as much agency (by which I roughly mean: large-scale motivations and the ability to pursue them over long timeframes)." --Ngo

"So it is legit harder to point out "the consequentialist parts of the cat" by looking for which sections of neurology are doing searches right there. That said, to the extent that the visual cortex does not get tweaked on failure to catch a mouse, it's not part of that consequentialist loop either." --Yudkowsky

"But the answer is that some problems are difficult in that they require solving lots of subproblems, and an easy way to solve all those subproblems is to use patterns which collectively have some coherence and overlap, and the coherence within them generalizes across all the subproblems. Lots of search orderings will stumble across something like that before they stumble across separate solutions for lots of different problems." --Yudkowsky

This is really making me want to keep working on my+Ramana's sequence on agency! :)

Okay, so one claim is that something like deontology is a fairly natural way for minds to operate.
("If that were true," he thought at once, "bureaucracies and books of regulations would be a lot more efficient than they are in real life.")

I think I disagree with Yudkowsky here? I almost want to say "the opposite is true; if people were all innately consequentialist then we wouldn't have so many blankfaces and bureaucracies would be a lot better because the rules would just be helpful guidelines." Or "Sure but books of regulations work surprisingly well, well enough that there's gotta be some innate deontology in humans." Or "Have you conversed with normal humans about ethics recently? If they are consequentialists they are terrible at it."

As such, on the Eliezer view as I understand it, we can see ourselves as asking for a very unnatural sort of object: a path-through-the-future that is robust enough to funnel history into a narrow band in a very wide array of circumstances, but somehow insensitive to specific breeds of human-initiated attempts to switch which narrow band it's pointed towards.

I think this is a great paragraph. It's a concise and reasonably accurate description of (an important part of) the problem.

I do think it, and this whole discussion, focuses too much on plans and not enough on agents. It's good for illustrating how the problem arises even in a context where we have some sort of oracle that gives us a plan and then we carry it out... but realistically our situation will be more dire than that because we'll be delegating to autonomous AGI agents. :(

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Attempted Gears Analysis of AGI Intervention Discussion With Eliezer · 2021-11-17T10:36:22.495Z · LW · GW

There are fates worse than 1. Fortunately they aren't particularly likely, but they are scary nonetheless.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised · 2021-11-16T11:03:33.360Z · LW · GW

Ah right, thanks!

How well do you think it would generalize? Like, say we made it 1000x bigger and trained it on 100x more training data, but instead of 1 game for 100x longer it was 100 games? Would it be able to do all the games? Would it be better or worse than models specialized to particular games, of similar size and architecture and training data length?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised · 2021-11-15T15:15:56.874Z · LW · GW
They train for 220k steps for each agent and mention that 100k steps takes 7 hours on 4 GPUs (no mention of which gpus, but maybe RTX3090 would be a good guess?)

Holy cow, am I reading that right? RTX3090 costs, like, $2000. So they were able to train this whole thing for about one day's worth of effort using equipment that cost less than $10K in total? That means there's loads of room to scale this up... It means that they could (say) train a version of this architecture with 1000x more parameters and 100x more training data for about $10M and 100 days. Right?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on What would we do if alignment were futile? · 2021-11-15T09:39:58.906Z · LW · GW

When I look at the world today, it really doesn't seem like a ship steered by evolution. (Instead it is a ship steered by no one, chaotically drifting.) Maybe if there is economic and technological stagnation for ten thousand years, then maybe evolution will get back in the drivers seat and continue the long slow process of aligning humans... but I think that's very much not the most probable outcome.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Comments on Carlsmith's “Is power-seeking AI an existential risk?” · 2021-11-14T11:25:23.945Z · LW · GW

Thanks for putting this stuff online!

FWIW I agree with Nate (and my opinions were largely independent, having read the report and written a response before seeing this). Happy to discuss with anyone interested.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Why I'm excited about Redwood Research's current project · 2021-11-13T14:07:45.324Z · LW · GW

Nice. I'm tentatively excited about this... are there any backfire risks? My impression was that the AI governance people didn't know what to push for because of massive strategic uncertainty. But this seems like a good candidate for something they can do that is pretty likely to be non-negative? Maybe the idea is that if we think more we'll find even better interventions and political capital should be conserved until then?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Why I'm excited about Redwood Research's current project · 2021-11-13T01:16:10.390Z · LW · GW

This is helpful, thanks!

In my ideal world those labs would have large “adversarial evaluation departments” that try extremely hard to find inputs (or random seeds, or “pseudo” inputs) where a powerful model attempts to deliberately cause harm, or do anything that even vaguely smells like causing harm or deliberately undermining safety measures, or trying to deceptively hide their capabilities, or etc. ... This won’t be enough on its own to be confident that models don’t do anything bad, and ideally this would be just one piece of a machine that created much more confidence, but it does make life meaningfully harder for a deceptively aligned model looking to cause trouble.

Our current world seems very far from this ideal world. As you know I have 10-year timelines. Do you think something like this ideal world may be realized by then? Do you think the EA community, perhaps the AI governance people, could bring about this world if we tried?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on What exactly is GPT-3's base objective? · 2021-11-11T10:24:27.677Z · LW · GW

Yes. I have the intuition that training stories will make this problem worse. But I don't think my intuition on this matter is trustworthy (what experience do I have to base it on?) so don't worry about it. We'll try it and see what happens.

(to explain the intuition a little bit: With inner/outer alignment, any would-be AGI creator will have to face up to the fact that they haven't solved outer alignment, because it'll be easy for a philosopher to find differences between the base objective they've programmed and True Human Values. With training stories, I expect lots of people to be saying more sophisticated versions of "It just does what I meant it to do, no funny business.")

Comment by Daniel Kokotajlo (daniel-kokotajlo) on What if memes are common in highly capable minds? · 2021-11-11T10:16:13.172Z · LW · GW

I don't understand, can you elaborate / unpack that?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on What exactly is GPT-3's base objective? · 2021-11-10T12:00:29.241Z · LW · GW

I was wondering if that was the case, haha. Thanks!

This is unfortunate, no? The AI safety community had this whole thing going with mesa-optimization and whatnot... now you propose to abandon the terminology and shift to this new frame? But what about all the people using the old terminology? Is the old terminology unsalvageable?

I do like your new thing and it seems better to me in some ways, but worse in others. I feel like I expect a failure mode where people exploit ambiguity and norm-laden concepts to convince themselves of happy fairy tales. I should think more about this and write a comment.

ETA: Here's an attempt to salvage the original inner/outer alignment problem framing:

We admit up front that it's a bit ambiguous what the base objective is, and thus there will be cases where it's ambiguous whether a mesa-optimizer is aligned to the base objective.

However, we say this isn't a big deal. We give a handful of examples of "reasonable construals" of the base objective, like I did in the OP, and say that all the classic arguments are arguments for the plausibility of cases where a mesa-optimizer is misaligned with every reasonable construal of the base objective.

Moreover, we make lemons out of lemonade, and point out that the fact there are multiple reasonable construals is itself reason to think inner alignment problems are serious and severe. I'm imagining an interlocutor who thinks "bah, it hasn't been established yet that inner-alignment problems are even a thing; it still seems like the default hypothesis is that you get what you train for, i.e. you get an agent that is trying to maximize predictive accuracy or whatever." And then we say "Oh? What exactly is it trying to maximize? Predictive accuracy full stop? Or predictive accuracy conditional on dataset D? Or is it instead trying to maximize reward, in which case it'd hack its reward channel if it could? Whichever one you think it is, would you not agree that it's plausible that it might instead end up trying to maximize one of the other ones?"

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Persuasion Tools: AI takeover without AGI or agency? · 2021-11-09T12:19:22.217Z · LW · GW

Thanks for this! Re: it's not really about AI, it's about memetics & ideologies: Yep, totally agree. (The OP puts the emphasis on the memetic ecosystem & thinks of persuasion tools as a change in the fitness landscape. Also, I wrote this story a while back.) What follows is a point-by-point response:

The most attractive values given a new technological/social situation are likely to be similar to those given the immediately preceding situation, so I'd generally expect the most attractive values to generally be endemic anyway or close enough to endemic values that they don't look like they are coming out of left field.

Maybe? I am not sure memetic evolution works this fast though. Think about how biological evolution doesn't adapt immediately to changes in environment, it takes thousands of years at least, arguably millions depending on what counts as "fully adapted" to the new environment. Replication times for memes are orders of magnitude faster, but that just means it should take a few orders of magnitude less time... and during e.g. a slow takeoff scenario there might just not be that much time. (Disclaimer: I'm ignorant of the math behind this sort of thing). Basically, as tech and economic progress speeds up but memetic evolution stays constant, we should expect there to be some point where the former outstrips the latter and the environment is changing faster than the attractive-memes-for-the-environment can appear and become endemic. Now of course memetic evolution is speeding up too, but the point is that until further argument I'm not 100% convinced that we aren't already out-of-equilibrium.

And of course for any given zero-sum conflict and any given human, one of the participants in that conflict would prefer push the human towards more attractive values, so they would be introduced even if not initially endemic.

Not sure this argument works. First of all, very few conflicts are actually zero sum. Usually there are some world-states that are worse by both players' lights than some other world-states. Humans being in the most attractive memetic state may be like this.

I don't think you can get paperclips this way, because people trying to get humans to maximize paperclips would be at a big disadvantage in memetic competition compared with the most attractive values (or even compared to more normal human values, which are presumably more attractive than random stuff).


Then the usual hope is that we are happy with attractive values, e.g. because deliberation and intentional behavior by humans makes "smarter" forms of current values more attractive relative to random bad stuff. And your concern is basically that under distributional shift, why should we think that?

Agreed. I would add that even without distributional shift it is unclear why we should expect attractive values to be good. (Maybe the idea is that good = current values because moral antirealism, and current values are the attractive ones for the current environment via the argument above? I guess I'd want that argument spelled out more and the premises argued for.)

Or perhaps more clearly: if which values are "most attractive" depends on features of the technological landscape, then it's hard to see why we should be happy just to "take the hand we're dealt" and be happy with the values that are most attractive on some default technological trajectory. Instead, we would end up with preferences over the technological trajectory.


This is not really distinctive to persuasion, it applies just as well to any changes in the environment that would change the process of deliberation/discussion. The hypothesis seems to be that "how good humans are at persuasion" is just a particularly important/significant kind of shift.

Yes? I think it's particularly important for reasons discussed in the "speculation" section, and because it seems to be in our immediate future and indeed our present. Basically, persuasion tools make ideologies (:= a particular kind of memeplex) stronger and stickier, and they change the landscape so that the ideologies that control the tech platforms have a significant advantage.

But it seems like what really matters is some ratio between how good you are at persuasion and how good you are at other skills that shape the future (or else perhaps you should be much more concerned about other increases in human capability, like education, that make us better at arguing). And in this sense it's less clear whether AI is better or worse than the status quo. I guess the main thing is that it's a candidate for a sharp distributional change and so that's the kind of thing that you would want to be unusually cautious about.

Has education increased much recently? Not in a way that's made us significantly more rational as a group, as far as I can tell. Changes in the US education system over the last 20 years presumably made some difference, but they haven't exactly put us on a bright path towards rational discussion of important issues. My guess is that the effect size is swamped by larger effects from the Internet.

I mostly think the most robust thing is that it's reasonable to be very interested in the trajectory of values, to think about how much you like the process of deliberation and discourse and selection and so on that shapes those values, and to think of changes as potentially irreversible (since future people would have no interest in reversing them).
The usual response to this argument is that perhaps future values are basically unrelated to present values anyway (since they will also converge to whatever values are most attractive given future technological situations). But this seems relatively unpersuasive because eventually you might expect to have many agents who try to deliberately make the future good rather than letting what happens, happen, and that this could eventually drive the rate of drift to 0. This seems fairly likely to happen eventually, but you might think that it will take long enough that existing value changes will still wash out.
Then we end up with a complicated set of moral / decision-theoretic questions about which values we are happy enough with. It's not really clear to me how you should feel about variation across humans, or across cultures, or for humans in new technological situations, or for a particular kind of deep RL, or what. It seems quite clear that we should care some, and I think given realistic treatments of moral uncertainty you should not care too much more about preventing drift than about preventing extinction given drift (e.g. 10x seems very hard to justify to me). But it generally seems like one of the more pressing questions in moral philosophy, and even if you care equally about those two things (suggesting that you'd value some drifted future population's values 50% as much as some kind of hypothetical ideal realization) you could still get much more traction by trying to prevent forms of drift that we don't endorse.

I agree that way of thinking about it seems useful and worthwhile. Are you also implying that thinking specifically about the effects of persuasion tools is not so useful or worthwhile?

I should say btw that you've been talking about values but I meant to talk about beliefs as well as values. Memes, in general. Beliefs can get feedback from reality more easily and thus hopefully the attractive beliefs are more likely to be good than the attractive values. But even so, there is room to wonder whether the attractive beliefs for a given environment will all be true... so far, for example, plenty of false beliefs seem to be pretty attractive...

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Persuasion Tools: AI takeover without AGI or agency? · 2021-11-08T22:48:40.565Z · LW · GW

To elaborate on this idea a bit more:

If a very persuasive agent AGI were to take over the world by persuading humans to do its bidding (e.g. maximize paperclips), this would count as an AI takeover scenario. The boots on the ground, the "muscle," would be human. And the brains behind the steering wheels and control panels would be human. And even the brains behind the tech R&D, the financial management, etc. -- even they would be human! The world would look very human and it would look like it was just one group of humans conquering the others. Yet it would still be fair to say it was an AI takeover... because the humans are ultimately controlled by, and doing the bidding of, the AGI.

OK, now what if it isn't an agent AGI at all? What if it's just a persuasion tool, and the humans (stupidly) used it on themselves, e.g. as a joke they program the tool to persuade people to maximize paperclips, and they test it on themselves, and it works surprisingly well, and in a temporary fit of paperclip-maximization the humans decide to constantly use the tool on themselves & upgrade it, thus avoiding "value drift" away from paperclip-maximization... Then we have a scenario that looks very similar to the first scenario, with a growing group of paperclip-maximizing humans conquering the rest of the world, all under the control of an AI, except that whereas in the first scenario the muscle, steering, and R&D was done by humans rather than AI, in this scenario the "agenty bits" such as planning and strategic understanding are also done by humans! It still counts as an AI takeover, I say, because an AI is making a group of humans conquer the world and reshape it according to inhuman values.

Of course the second scenario is super unrealistic -- humans won't be so stupid as to use their persuasion tools on themselves, right? Well... they probably won't try to persuade themselves to maximize paperclips, and if they did it probably wouldn't work because persuasion tools won't be that effective (at least at first.) But some (many?) humans probably WILL use their persuasion tools on themselves, to persuade themselves to be truer, more faithful, more radical believers in whatever ideology they already subscribe to. Persuasion tools don't have to be that powerful to have an effect here; even a single-digit-percentage-point effect size on various metrics would have a big impact, I think, on society.

Persuasion tools will take as input a payload-- some worldview, some set of statements, some set of goals/values -- and then work to create an expanding faction of people who are dogmatically committed to that payload. (The people who are using said tools with said input on themselves.)

I think it's an understatement to say that the vast majority of people who use persuasion tools on themselves in this manner will be imbibing payloads that aren't 100% true and good. Mistakes happen; in the past, even the great philosophers were wrong about some things, surely we are all wrong about some things today, even some things we feel very confident are true/good. I'd bet that it's not merely the vast majority, but literally everyone!

So this situation seems both realistic to me (unfortunately) and also fairly described as a case of AI takeover (though certainly a non-central case. And I don't care about the terminology we use here, I just think it's amusing.)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Cortés, Pizarro, and Afonso as Precedents for Takeover · 2021-11-08T12:42:12.239Z · LW · GW

I forgot to give an update: Now I have read a handful of real history books on the subject, and I think the original post still stands strong.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Rob B's Shortform Feed · 2021-11-08T11:20:05.375Z · LW · GW

I don't think so? It's possible that it did and I forgot.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Speaking of Stag Hunts · 2021-11-06T19:56:29.809Z · LW · GW

It's only a yellow flag if you are spending the money. If you are uninvolved and e.g. the Lightcone team is running the show, then it's fine.

(But I have no problem with you doing it either)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Speaking of Stag Hunts · 2021-11-06T19:35:08.521Z · LW · GW
Hire a team of well-paid moderators for a three-month high-effort experiment of responding to every bad comment with a fixed version of what a good comment making the same point would have looked like.  Flood the site with training data.

What's so terrible about this idea? I imagine the main way it could go wrong is not being able to find enough people willing to do it / accidentally having too low a bar and being overwhelmed by moderators who don't know what they are doing and promote the wrong norms. But I feel like there are probably enough people on LW that if you put out a call for applications for a very lucrative position (maybe it would be a part-time position for three months, so people don't have to quit their jobs) and you had a handful of people you trusted (e.g. Lightcone?) runing the show, it would probably work.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2021-11-05T14:45:48.977Z · LW · GW

Yeah, this is a map of how philosophy fits together, so it's about ideal agents/minds not actual ones. Though obviously there's some connection between the two.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Transcript: "You Should Read HPMOR" · 2021-11-03T20:24:11.272Z · LW · GW

How did the audience react? Did you get any feedback? Do you think many of them went and read HPMOR? Did they like it?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised · 2021-11-02T11:33:58.050Z · LW · GW

Some basic questions in case anyone knows and wants to help me out:

1. Is this a single neural net that can play all the Atari games well, or a different net for each game?

2. How much compute was spent on training?

3. How many parameters?

4. Would something like this work for e.g. controlling a robot using only a few hundred hours of training data? If not, why not?

5. What is the update / implication of this, in your opinion?

(I did skim the paper and use the search bar, but was unable to answer these questions myself, probably due to lack of expertise)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Fun with +12 OOMs of Compute · 2021-10-30T14:10:41.821Z · LW · GW

Sorry, somehow I missed this. Basically, the answer is that we definitely shouldn't just extrapolate out the AI and compute trend into the future, and Ajeya's and my predictions are not doing that. Instead we are assuming something more like the historic 2 ooms a decade trend, combined with some amount of increased spending conditional on us being close to AGI/TAI/etc. Hence my conditional claim above:

Conditional on +6 OOMs being enough with 2020's ideas, it'll happen by 2030. Indeed, conditional on +8 OOMs being enough with 2020's ideas, I think it'll probably happen by 2030.

If you want to discuss this more with me, I'd love to, how bout we book a call?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on A very crude deception eval is already passed · 2021-10-30T03:00:52.062Z · LW · GW

Somewhat related thread (which I think was super valuable for me at least, independently) Experimentally evaluating whether honesty generalizes - LessWrong