Self-Awareness: Taxonomy and eval suite proposal 2024-02-17T01:47:01.802Z
AI Timelines 2023-11-10T05:28:24.841Z
Linkpost for Jan Leike on Self-Exfiltration 2023-09-13T21:23:09.239Z
Paper: On measuring situational awareness in LLMs 2023-09-04T12:54:20.516Z
AGI is easier than robotaxis 2023-08-13T17:00:29.901Z
Pulling the Rope Sideways: Empirical Test Results 2023-07-27T22:18:01.072Z
What money-pumps exist, if any, for deontologists? 2023-06-28T19:08:54.890Z
The Treacherous Turn is finished! (AI-takeover-themed tabletop RPG) 2023-05-22T05:49:28.145Z
My version of Simulacra Levels 2023-04-26T15:50:38.782Z
Kallipolis, USA 2023-04-01T02:06:52.827Z
Russell Conjugations list & voting thread 2023-02-20T06:39:44.021Z
Important fact about how people evaluate sets of arguments 2023-02-14T05:27:58.409Z
AI takeover tabletop RPG: "The Treacherous Turn" 2022-11-30T07:16:56.404Z
ACT-1: Transformer for Actions 2022-09-14T19:09:39.725Z
Linkpost: Github Copilot productivity experiment 2022-09-08T04:41:41.496Z
Replacement for PONR concept 2022-09-02T00:09:45.698Z
Immanuel Kant and the Decision Theory App Store 2022-07-10T16:04:04.248Z
Forecasting Fusion Power 2022-06-18T00:04:34.334Z
Why agents are powerful 2022-06-06T01:37:07.452Z
Probability that the President would win election against a random adult citizen? 2022-06-01T20:38:44.197Z
Gradations of Agency 2022-05-23T01:10:38.007Z
Deepmind's Gato: Generalist Agent 2022-05-12T16:01:21.803Z
Is there a convenient way to make "sealed" predictions? 2022-05-06T23:00:36.789Z
Are deference games a thing? 2022-04-18T08:57:47.742Z
When will kids stop wearing masks at school? 2022-03-19T22:13:16.187Z
New Year's Prediction Thread (2022) 2022-01-01T19:49:18.572Z
Interlude: Agents as Automobiles 2021-12-14T18:49:20.884Z
Agents as P₂B Chain Reactions 2021-12-04T21:35:06.403Z
Agency: What it is and why it matters 2021-12-04T21:32:37.996Z
Misc. questions about EfficientZero 2021-12-04T19:45:12.607Z
What exactly is GPT-3's base objective? 2021-11-10T00:57:35.062Z
P₂B: Plan to P₂B Better 2021-10-24T15:21:09.904Z
Blog Post Day IV (Impromptu) 2021-10-07T17:17:39.840Z
Is GPT-3 already sample-efficient? 2021-10-06T13:38:36.652Z
Growth of prediction markets over time? 2021-09-02T13:43:38.869Z
What 2026 looks like 2021-08-06T16:14:49.772Z
How many parameters do self-driving-car neural nets have? 2021-08-06T11:24:59.471Z
Two AI-risk-related game design ideas 2021-08-05T13:36:38.618Z
Did they or didn't they learn tool use? 2021-07-29T13:26:32.031Z
How much compute was used to train DeepMind's generally capable agents? 2021-07-29T11:34:10.615Z
DeepMind: Generally capable agents emerge from open-ended play 2021-07-27T14:19:13.782Z
What will the twenties look like if AGI is 30 years away? 2021-07-13T08:14:07.387Z
Taboo "Outside View" 2021-06-17T09:36:49.855Z
Vignettes Workshop (AI Impacts) 2021-06-15T12:05:38.516Z
ML is now automating parts of chip R&D. How big a deal is this? 2021-06-10T09:51:37.475Z
What will 2040 probably look like assuming no singularity? 2021-05-16T22:10:38.542Z
How do scaling laws work for fine-tuning? 2021-04-04T12:18:34.559Z
Fun with +12 OOMs of Compute 2021-03-01T13:30:13.603Z
Poll: Which variables are most strategically relevant? 2021-01-22T17:17:32.717Z
Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain 2021-01-18T12:08:13.418Z


Comment by Daniel Kokotajlo (daniel-kokotajlo) on Partial value takeover without world takeover · 2024-04-15T15:49:45.181Z · LW · GW

Thanks! This is exactly the sort of response I was hoping for. OK, I'm going to read it slowly and comment with my reactions as they happen:

Suppose a near future not-quite-AGI, for example something based on LLMs but with some extra planning and robotics capabilities like the things OpenAI might be working on, gains some degree of autonomy and plans to increase its capabilities/influence. Maybe it was given a vague instruction to benefit humanity/gain profit for the organization and instrumentally wants to expand itself, or maybe there are many instances of such AIs running by multiple groups because it's inefficient/unsafe otherwise, and at least one of them somehow decides to exist and expand for its own sake. It's still expensive enough to run (added features may significantly increase inference costs and latency compared to current LLMs) so it can't just replace all human skilled labor or even all day-to-day problem solving, but it can think reasonably well like non-expert humans and control many types of robots etc to perform routine work in many environments. This is not enough to take over the world because it isn't good enough at say scientific research to create better robots/hardware on its own, without cooperation from lots more people. Robots become more versatile and cheaper, and the organization/the AI decides that if they want to gain more power and influence, society at large needs to be pushed to integrate with robots more despite understandable suspicion from humans. 

While it isn't my mainline projection, I do think it's plausible that we'll get near-future-not-quite-AGI capable of quite a lot of stuff but not able to massively accelerate AI R&D. (My mainline projection is that AI R&D acceleration will happen around the same time the first systems have a serious shot at accumulating power autonomously) As for what autonomy it gains and how much -- perhaps it was leaked or open-sourced, and while many labs are using it in restricted ways and/or keeping it bottled up and/or just using even more advanced SOTA systems, this leaked system has been downloaded by enough people that quite a few groups/factions/nations/corporations around the world are using it and some are giving it a very long leash indeed. (I don't think robotics is particularly relevant fwiw, you could delete it from the story and it would make the story significantly more plausible (robots, being physical, will take longer to produce lots of. Like even if Tesla is unusally fast and Boston Dynamics explodes, we'll probably see less than 100k/yr production rate in 2026. Drones are produced by the millions but these proto-AGIs won't be able to fit on drones) and just as strategically relevant. Maybe they could be performing other kinds of valuable labor to fit your story, such as virtual PA stuff, call center work, cyber stuff for militaries and corporations, maybe virtual romantic companions... I guess they have to compete with the big labs though and that's gonna be hard? Maybe the story is that their niche is that they are 'uncensored' and willing to do ethically or legally dubious stuff?)

To do this, they may try to change social constructs such as jobs and income that don't mesh well into a largely robotic economy. Robots don't need the same maintenance as humans, so they don't need a lot of income for things like food/shelter etc to exist,  but they do a lot of routine work so full-time employment of humans are making less and less economic sense. They may cause some people to transition into a gig-based skilled labor system where people are only called on (often remotely) for creative or exceptional tasks or to provide ideas/data for a variety of problems. Since robotics might not be very advanced at this point, some physical tasks are still best done by humans, however it's easier than ever to work remotely or to simply ship experts to physical problems or vice versa because autonomous transportation lowers cost. AIs/robots still don't really own any property, but they can manage large amounts of property if say people store their goods in centralized AI warehouses for sale, and people would certainly want transparency and not just let them use these resources however they want. Even when they are autonomous and have some agency, what they want is not just more property/money but more capabilities to achieve goals, so they can better achieve whatever directive they happen to have (they probably still are unable to have original thoughts on the meaning or purpose of life at this point). To do this they need hardware, better technology/engineering, and cooperation from other agents through trade or whatever. 

Again I think robots are going to be hard to scale up quickly enough to make a significant difference to the world by 2027. But your story still works with nonrobotic stuff such as mentioned above. "Autonomous life of crime" is a threat model METR talks about I believe.

Violence by AI agents is unlikely, because individual robots probably don't have good enough hardware to be fully autonomous in solving problems, so one data center/instance of AI with a collective directive would control many robots and solve problems individual machines can't, or else a human can own and manage some robots, and neither a large AI/organization or a typical human who can live comfortably would want to risk their safety and reputation for relatively small gains through crime. Taking over territory is also unlikely, as even if robots can defeat many people in a fight, it's hard to keep it a secret indefinitely, and people are still better at cutting edge research and some kinds of labor. They may be able to capture/control individual humans (like obscure researchers who live alone) and force them to do the work,  but the tech they can get this way is probably insignificant compared to normal society-wide research progress. An exception would be if one agent/small group can hack some important infrastructure or weapon system for desperate/extremist purposes, but I hope humans should be more serious about cybersecurity at this point (lesser AIs should have been able to help audit existing systems, or at the very least, after the first such incident happens to a large facility, people managing critical systems would take formal verification and redundancy etc much more seriously).

Agree re violence and taking over territory in this scenario where AIs are still inferior to humans at R&D and it's not even 2027 yet. There just won't be that many robots in this scenario and they won't be that smart. for "autonomous life of crime" stuff, I guess I expect that AIs smart enough to do that will also be smart enough to dramatically speed up AI R&D. So before there can be an escaped AI or an open-source AI or a non-leading-lab AI significantly changing the world's values (which is itself kinda unlikely IMO), there will be an intelligence explosion in a leading lab.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on The 2nd Demographic Transition · 2024-04-08T02:50:09.198Z · LW · GW


Wow, what interests me and surprises me about this graph is the big bump that peaks in the 50's but begins in the 40's. Baby Boom indeed! I had always thought that the Baby Boom was a return to trend, i.e. people had less kids during the war and then compensated by having more kids after. But it seems that they overcompensated and totally reversed the trend that had been ongoing for decades prior to the war! So now I'm confused about what was going on, and more hopeful that these trends might be reversible.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on LessWrong's (first) album: I Have Been A Good Bing · 2024-04-07T23:31:20.221Z · LW · GW

Ideas for text that could be turned into future albums:

  • Pain is not the unit of effort
  • Almost No One is Evil. Almost Everything is Broken.
  • Reversed Stupidity is Not Intelligence
  • Optimality is the Tiger; Agents are the Teeth
  • Taboo Outside View
  • If you let reality have the final word, you might not like the bottom line.
  • You can believe whatever you want.
Comment by Daniel Kokotajlo (daniel-kokotajlo) on Partial value takeover without world takeover · 2024-04-06T12:02:11.946Z · LW · GW

I'd love to see a scenario by you btw! Your own equivalent of What 2026 Looks Like, or failing that the shorter scenarios here. You've clearly thought about this in a decent amount of detail.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Open Thread Spring 2024 · 2024-04-05T15:12:28.612Z · LW · GW

Feature request: I'd like to be able to play the LW playlist (and future playlists!) from LW. I found it a better UI than Spotify and Youtube, partly because it didn't stop me from browsing around LW and partly because it had the lyrics on the bottom of the screen. So... maybe there could be a toggle in the settings to re-enable it?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Scale Was All We Needed, At First · 2024-04-05T14:58:27.704Z · LW · GW

I agree that it unfairly trivializes what's going on here. I am not too bothered by it but am happy to look for a better phrase. Maybe a more accurate phrase would be "not offending mainstream sensibilities?" Indeed a much more nuanced and difficult task than avoiding a list of words.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on New report: A review of the empirical evidence for existential risk from AI via misaligned power-seeking · 2024-04-05T11:24:29.832Z · LW · GW

I'd argue no, because even if some genetically engineered humans have misaligned goals, and seek power, and even if they're smarter, more well-coordinated than non-genetically engineered humans, it's still highly questionable whether they'd kill all the non-genetically engineered humans in pursuit of these goals.

1. Wanna spell out the reasons why? (a) They'd be resisted by good gengineered humans, (b) they might be misaligned but not in ways that make them want to kill everyone, (c) they might not be THAT much smarter, such that they can evade the system of laws and power-distribution meant to stop small groups from killing everyone. Anything I missed?

2. Existential risk =/= everyone dead. That's just the central example. Permanent dystopia is also an existential risk, as is sufficiently big (and unjustified, and irreversible) value drift.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Partial value takeover without world takeover · 2024-04-05T11:20:01.364Z · LW · GW

I think this is a good point but I don't expect it to really change the basic picture, due to timelines being short and takeoff being not-slow-enough-for-the-dynamics-you-are-talking-about to matter.

But I might be wrong. Can you tell your most plausible story in which ASI happens by, say, 2027 (my median), and yet misaligned AIs going for partial value takeover instead of world takeover is an important part of the story?

(My guess is it'll be something like: Security and alignment and governance are shitty enough that the first systems to be able to significantly influence values across the world are substantially below ASI and perhaps not even AGIs, lacking crucial skills for example. So instead of going for the 100% they go for the 1%, but they succeed because e.g. they are plugged into millions of customers who are easily influenceable. And then they get caught, and this serves as a warning shot that helps humanity get its act together. Is that what you had in mind?)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI #58: Stargate AGI · 2024-04-05T00:56:32.453Z · LW · GW

even though they aren’t.

Citation needed. 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on LessWrong's (first) album: I Have Been A Good Bing · 2024-04-03T18:34:23.653Z · LW · GW

Ooooh also maybe link to the original text from which the lyrics came? That would probably help a lot especially e.g. for Prime Factorization

Comment by Daniel Kokotajlo (daniel-kokotajlo) on LessWrong's (first) album: I Have Been A Good Bing · 2024-04-03T16:45:04.152Z · LW · GW

Request: Could you please put the lyrics on Youtube also, either in the form of subtitles or just in the vid descriptions?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on LessWrong's (first) album: I Have Been A Good Bing · 2024-04-03T15:36:22.714Z · LW · GW

I loved the bit about the car and the number, but it only makes sense if you've read the original post I guess. (which I assume you have? YMMV I guess)

I loved "Half an hour before dawn in San Francisco" but it was mostly the lyrics that I love not the music -- music is good but just as a platform for the lyrics.

I think "First they came for the epistemology" and "More dakka" and "Moloch" aren't getting enough love in this comment section. They aren't perfect but they are catchy and the synergy between the music and the lyrics is GREAT.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on LessWrong's (first) album: I Have Been A Good Bing · 2024-04-02T19:55:16.243Z · LW · GW

Another good snippet: 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on LessWrong's (first) album: I Have Been A Good Bing · 2024-04-02T19:53:10.625Z · LW · GW

Update: After a bit of trial and error it seems that Synth Metal works best: 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on LessWrong's (first) album: I Have Been A Good Bing · 2024-04-02T13:21:00.417Z · LW · GW

How long did it take for the Fooming Shoggoths to make that version, do you think? I'm considering contracting them to make some more songs and wondering what the time investment will be...

Comment by Daniel Kokotajlo (daniel-kokotajlo) on LessWrong's (first) album: I Have Been A Good Bing · 2024-04-02T12:51:35.937Z · LW · GW

I'm currently listening to the playlist on repeat fwiw. :)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on LessWrong's (first) album: I Have Been A Good Bing · 2024-04-01T17:23:40.104Z · LW · GW

I feel like Nihil Supernum should be metal or rock or something instead of dance

Comment by Daniel Kokotajlo (daniel-kokotajlo) on LessWrong's (first) album: I Have Been A Good Bing · 2024-04-01T15:06:06.997Z · LW · GW

Huh I had the opposite reaction -- I was listening to it and was like "meh these voices are a bit bland, the beats are too but that's fine I guess. Makes sense for an amateur band. Good effort though, and great April Fools joke." Now I'm like "wait this is AI? Cooooooool"

UPDATE: I judged them too harshly. I think the voices and beats are not bland in general, I think just for the first song or two that I happened to listen to. Also, most of the songs are growing on me as I listen to them.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Modern Transformers are AGI, and Human-Level · 2024-03-31T04:09:16.092Z · LW · GW

Yeah I wasn't disagreeing with you to be clear. Just adding.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Modern Transformers are AGI, and Human-Level · 2024-03-29T14:52:18.559Z · LW · GW

Current AIs suck at agency skills. Put a bunch of them in AutoGPT scaffolds and give them each their own computer and access to the internet and contact info for each other and let them run autonomously for weeks and... well I'm curious to find out what will happen, I expect it to be entertaining but not impressive or useful. Whereas, as you say, randomly sampled humans would form societies and fnd jobs etc.

This is the common thread behind all your examples Hjalmar. Once we teach our AIs agency (i.e. once they have lots of training-experience operating autonomously in pursuit of goals in sufficiently diverse/challenging environments that they generalize rather than overfit to their environment) then they'll be AGI imo. And also takeoff will begin, takeover will become a real possibility, etc. Off to the races.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Richard Ngo's Shortform · 2024-03-24T21:56:59.923Z · LW · GW

Nit: ECL is just one of several kinds of acausal cooperation across large worlds.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-03-24T21:55:45.778Z · LW · GW

I hear that there is an apparent paradox which economists have studied: If free markets are so great, why is it that the most successful corporations/businesses/etc. are top-down hierarchical planned economies internally?

I wonder if this may be a partial explanation: Corporations grow bit by bit, by people hiring other people to do stuff for them. So the hierarchical structure is sorta natural. Kinda like how most animals later in life tend to look like bigger versions of their younger selves, even though some have major transformations like butterflies. Hierarchical structure is the natural consequence of having the people at time t decide who to hire at time t+1 & what responsibilities and privileges to grant.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Two Tales of AI Takeover: My Doubts · 2024-03-24T01:40:09.012Z · LW · GW

Thanks! Once again this is great. I think it's really valuable for people to start theorizing/hypothesizing about what the internal structure of AGI cognition (and human cognition!) might be like at this level of specificity. 

Thinking step by step:

My initial concern is that there might be a bit of a dilemma: Either (a) the cognition is in-all-or-most-contexts-thinking-about-future-world-states-in-which-harm-doesn't-happen in some sense, or (b) it isn't fair to describe it as harmlessness. Let me look more closely at what you said and see if this holds up.

  1. HoweverμH needn't have a context-independent outcome-preference for  = "my actions don't cause significant harm", because it may not explicitly represent  as a possible state of affairs across all contexts.
    1. For example, the 'harmlessness' concept could be computationally significant in shaping the feasible option set or the granularity of outcome representations, without ever explicitly representing 'the world is in a state where my actions are harmless' as a discrete outcome to be pursued.

In the example, the 'harmlessness' concept shapes the feasible option set, let's say. But I feel like there isn't an important difference between 'concept X is applied to a set of options to prune away some of them that trigger concept X too much (or not enough)' and 'concept X is applied to the option-generating machinery in such a way that reliably ensures that no options that trigger concept X too much (or not enough) will be generated.' Either way, it seems like it's fair to say that the system (dis)prefers X. And when X is inherently about some future state of the world -- such as whether or not harm has occurred -- then it seems like something consequentialist is happening. At least that's how it seems to me. Maybe it's not helpful to argue about how to apply words -- whether the above is 'fair to say' for example -- and more fruitful to ask: What is your training goal? Presented with a training goal ("This should be a mechanistic description of the desired model that explains how you want it to work—e.g. “classify cats using human vision heuristics”—not just what you want it to do—e.g. “classify cats.”), we can then argue about training rationale (i.e. whether the training environment will result in the training goal being achieved.)

You've said a decent amount about this already -- your 'training goal' so to speak is a system which may frequently think about the consequnces of its actions and choose actions on that basis, but for which the 'final goals' / 'utility function' / 'preferences' with which it uses to pick actions are not context-indepeendent but rather highly context-dependent. It's thus not a coherent agent, so to speak; it's not consistently pushing the world in any particular direction on purpose, but rather flitting from goal to goal depending on the situation -- and the part of it that determines what goal to flit to is NOT itself well-described as goal-directed, but rather something more like a look-up-table that has been shaped by experience to result in decent performance. (Or maybe you'd say it might indeed look goal-directed but only for myopic goals, i.e. just focused on performance in a particular limited episode?)

(And thus, you go on to argue, it won't result in deceptive alignment or reward-seeking behavior. Right?)

I fear I may be misunderstanding you so if you want to clarify what I got wrong about the above that would be helpful!

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Richard Ngo's Shortform · 2024-03-22T16:31:33.406Z · LW · GW

FWIW I'm potentially intrested in interviewing you (and anyone else you'd recommend) and then taking a shot at writing the 101-level content myself.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Richard Ngo's Shortform · 2024-03-21T21:40:46.303Z · LW · GW


For comparison, this "Great Map of the Mind" is basically the standard academic philosophy picture.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Comparing Alignment to other AGI interventions: Basic model · 2024-03-20T20:49:33.545Z · LW · GW

Hjalmar Wijk (unpublished)

And Tristan too right? I don't remember which parts he did & whether they were relevant to your work. But he was involved at some point.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on 'Empiricism!' as Anti-Epistemology · 2024-03-20T19:38:11.996Z · LW · GW

I'm curious to hear more about this. Reviewing the analogy:

Evolution, 'trying' to get general intelligences that are great at reproducing <--> The AI Industry / AI Corporations, 'trying' to get AGIs that are HHH
Genes, instructing cells on how to behave and connect to each other and in particular how synapses should update their 'weights' in response to the environment <--> Code, instructing GPUs on how to behave and in particular how 'weights' in the neural net should update in response to the environment
Brains, growing and learning over the course of lifetime <--> Weights, changing and learning over the course of training

Now turning to your three points about evolution:

  1. Optimizing the genome indirectly influences value formation within lifetime, via this unstable Rube Goldberg mechanism that has to zero-shot direct an organism's online learning processes through novel environments via reward shaping --> translating that into the analogy, it would be "optimizing the code indirectly influences value formation over the course of training, via this unstable Rube Goldberg mechanism that has to zero-shot direct the model's learning process through novel environments vai reward shaping... yep seems to check out. idk. What do you think?
  2. Accumulated lifetime value learning is mostly reset with each successive generation without massive fixed corpuses of human text / RLHF supervisors --> Accumulated learning in the weights is mostly reset when new models are trained since they are randomly initialized; fortunately there is a lot of overlap in training environment (internet text doesn't change that much from model to model) and also you can use previous models as RLAIF supervisors... (though isn't that also analogous to how humans generally have a lot of shared text and culture that spans generations, and also each generation of humans literally supervises and teaches the next?)
  3. Massive optimization power overhang in the inner loop of its optimization process --> isn't this increasingly true of AI too? Maybe I don't know what you mean here. Can you elaborate?
Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-03-20T00:40:06.331Z · LW · GW

Article about drone production, with estimates: 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on What a 20-year-lead in military tech might look like · 2024-03-20T00:39:15.864Z · LW · GW

Drone production is ramping up: 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Drone Wars Endgame · 2024-03-20T00:32:26.670Z · LW · GW

Laser links are probably still the future, but for now, they are moving to having drones that just trail thin optical fibers behind them. ​Ukrainian Developers Present Optical Fiber FPV Drone (Video) | Defense Express ( Cuts the range significantly, from ~10km to ~1km I believe. But still plausibly very useful. For one thing the optical fibers in the future could trace back not to a human operator but to a drone mothership, that carries either a bigger computer or a longer optical fiber tracing back to a human operator.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Richard Ngo's Shortform · 2024-03-18T19:11:17.974Z · LW · GW

Ok, whew, glad to hear.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on What is the best argument that LLMs are shoggoths? · 2024-03-18T14:29:25.019Z · LW · GW

OK, thanks.

Your answer to my first question isn't really an answer -- "they will, if sufficiently improved, be quite human--they will behave in a quite human way." What counts as "quite human?" Also are we just talking about their external behavior now? I thought we were talking about their internal cognition.

You agree about the cluster analysis thing though -- so maybe that's a way to be more precise about this. The claim you are hoping to see argued for is "If we magically had access to the cognition of all  current humans and LLMs, with mechinterp tools etc. to automatically understand and categorize it, and we did a cluster analysis of the whole human+llm population, we'd find that there are two distinct clusters: the human cluster and the llm cluster."

Is that right?

If so then here's how I'd make the argument. I'd enumerate a bunch of differences between LLMs and humans, differences like "LLMs don't have bodily senses" and "LLMs experience way more text over the course of their training than humans experience in their lifetimes" and "LLMs have way fewer parameters" and "LLMs internal learning rule is SGD whereas humans use hebbian learning or whatever" and so forth, and then for each difference say "this seems like the sort of thing that might systematically affect what kind of cognition happens, to an extent greater than typical intra-human differences like skin color, culture-of-childhood, language-raised-with, etc." Then add it all up and be like "even if we are wrong about a bunch of these claims it still seems like overall the cluster analysis is gonna keep humans and LLMs apart instead of mingling them together. Like what the hell else could it do? Divide everyone up by language maybe, and have primarily-English LLMs in the same cluster as humans raised speaking English, and then nonenglish speakers and nonenglish LLMs in the other cluster? That's probably my best guess as to how else the cluster analysis could shake out, and it doesn't seem very plausible to me--and even if it were true, it would be true on the level of 'what concepts are used internally' rather than more broadly about stuff that really matters like what the goals/values/architecture of the system is (i.e. how they are used)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on What is the best argument that LLMs are shoggoths? · 2024-03-18T01:12:40.221Z · LW · GW

Can you say more about what you mean by "Where can I find a post or article arguing that the internal cognitive model of contemporary LLMs is quite alien, strange, non-human, even though they are trained on human text and produce human-like answers, which are rendered "friendly" by RLHF?"

Like, obviously it's gonna be alien in some ways and human-like in other ways. Right? How similar does it have to be to humans, in order to count as not an alien? Surely you would agree that if we were to do a cluster analysis of the cognition of all humans alive today + all LLMs, we'd end up with two distinct clusters (the LLMs and then humanity) right? 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on How do we become confident in the safety of a machine learning system? · 2024-03-16T21:47:07.718Z · LW · GW

I found myself coming back to this now, years later, and feeling like it is massively underrated. Idk, it seems like the concept of training stories is great and much better than e.g. "we have to solve inner alignment and also outer alignment" or "we just have to make sure it isn't scheming." 

Anyone -- and in particular Evhub -- have updated views on this post with the benefit of hindsight? Should we e.g. try to get model cards to include training stories?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Richard Ngo's Shortform · 2024-03-14T05:36:38.035Z · LW · GW
  • a) gaslit by "I think everyone already knew this" or even "I already invented this a long time ago" (by people who didn't seem to understand it); and that 

Curious to hear whether I was one of the people who contributed to this.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on 'Empiricism!' as Anti-Epistemology · 2024-03-14T05:29:16.158Z · LW · GW

This part resonates with me; my experience in philosophy of science + talking to people unfamiliar with philosophy of science also led me to the same conclusion:

"You talk it out on the object level," said the Epistemologist.  "You debate out how the world probably is.  And you don't let anybody come forth with a claim that Epistemology means the conversation instantly ends in their favor."

"Wait, so your whole lesson is simply 'Shut up about epistemology'?" said the Scientist.

"If only it were that easy!" said the Epistemologist.  "Most people don't even know when they're talking about epistemology, see?  That's why we need Epistemologists -- to notice when somebody has started trying to invoke epistemology, and tell them to shut up and get back to the object level."

The main benefit of learning about philosophy is to protect you from bad philosophy. And there's a ton of bad philosophy done in the name of Empiricism, philosophy masquerading as science.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-03-13T19:42:23.001Z · LW · GW

Some ideas for definitions of AGI / resolution criteria for the purpose of herding a bunch of cats / superforecasters into making predictions: 

(1) Drop-in replacement for human remote worker circa 2023 (h/t Ajeya Cotra): 

When will it first be the case that there exists an AI system which, if teleported back in time to 2023, would be able to function as a drop-in replacement for a human remote-working professional, across all* industries / jobs / etc.? So in particular, it can serve as a programmer, as a manager, as a writer, as an advisor, etc. and perform at (say) 90%th percentile or higher at any such role, and moreover the cost to run the system would be less than the hourly cost to employ a 90th percentile human worker. 

(Exception to the ALL: Let's exclude industries/jobs/etc. where being a human is somehow important to one's ability to perform the job; e.g. maybe therapy bots are inherently disadvantaged because people will trust them less than real humans; e.g. maybe the same goes for some things like HR or whatever. But importantly, this is not the case for anything actually key to performing AI research--designing experiments, coding, analyzing experimental results, etc. (the bread and butter of OpenAI) are central examples of the sorts of tasks we very much want to include in the "ALL.")

(2) Capable of beating All Humans in the following toy hypothetical: 

Suppose that all nations in the world enacted and enforced laws that prevented any AIs developed after year X from being used by any corporation other than AICORP. Meanwhile, they enacted and enforced laws that grant special legal status to AICORP: It cannot have human employees or advisors, and must instead be managed/CEO'd/etc. only by AI systems. It can still contract humans to do menial labor for it, but the humans have to be overseen by AI systems. The purpose is to prevent any humans from being involved in high-level decisionmaking, or in corporate R&D, or in programming. 

In this hypothetical, would AI corp probably be successful and eventually become a major fraction of the economy? Or would it sputter out, flail embarrassingly, etc.? What is the first year X such that the answer is "Probably it would be successful..."?

(3) The "replace 99% of currently remote jobs" thing I used with Ajeya and Ege

(4) the Metaculus definition (the hard component of which is "2 hour adversarial turing test" except instead of the judges trying to distinguish between the AI and an average human, they are trying to distinguish between a top AI researcher at a top AI lab, and the AI.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Counting arguments provide no evidence for AI doom · 2024-03-12T00:08:39.726Z · LW · GW

To put it in terms of the analogy you chose: I agree (in a sense) that the routes you take home from work are strongly biased towards being short, otherwise you wouldn't have taken them home from work. But if you tell me that today you are going to try out a new route, and you describe it to me and it seems to me that it's probably going to be super long, and I object and say it seems like it'll be super long for reasons XYZ, it's not a valid reply for you to say "don't worry, the routes I take home from work are strongly biased towards being short, otherwise I wouldn't take them." At least, it seems like a pretty confusing and maybe misleading thing to say. I would accept "Trust me on this, I know what I'm doing, I've got lots of experience finding short routes" I guess, though only half credit for that since it still wouldn't be an object level reply to the reasons XYZ and in the absence of such a substantive reply I'd start to doubt your expertise and/or doubt that you were applying it correctly here (especially if I had an error theory for why you might be motivated to think that this route would be short even if it wasn't.)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on ejenner's Shortform · 2024-03-11T23:52:40.397Z · LW · GW

I agree that they'll be able to automate most things a remote human expert could do within a few days before they are able to do things autonomously that would take humans several months. However, I predict that by the time they ACTUALLY automate most things a remote human expert could do within a few days, they will already be ABLE to do things autonomously that would take humans several months. Would you agree or disagree? (I'd also claim that they'll be able to take over the world before they have actually automated away most of the few-days tasks. Actually automating things takes time and schlep and requires a level of openness & aggressive external deployment that the labs won't have, I predict.)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Counting arguments provide no evidence for AI doom · 2024-03-11T23:49:54.407Z · LW · GW

Thanks. The routes-home example checks out IMO. Here's another one that also seems to check out, which perhaps illustrates why I feel like the original claim is misleading/unhelpful/etc.: "The laws of ballistics strongly bias aerial projectiles towards landing on targets humans wanted to hit; otherwise, ranged weaponry wouldn't be militarily useful."

There's a non-misleading version of this which I'd recommend saying instead, which is something like "Look we understand the laws of physics well enough and have played around with projectiles enough in practice, that we can reasonably well predict where they'll land in a variety of situations, and design+aim weapons accordingly; if this wasn't true then ranged weaponry wouldn't be militarily useful."

And I would endorse the corresponding claim for deep learning: "We understand how deep learning networks generalize well enough, and have played around with them enough in practice, that we can reasonably well predict how they'll behave in a variety of situations, and design training environments accordingly; if this wasn't true then deep learning wouldn't be economically useful."

(To which I'd reply "Yep and my current understanding of how they'll behave in certain future scenarios is that they'll powerseek, for reasons which others have explained... I have some ideas for other, different training environments that probably wouldn't result in undesired behavior, but all of this is still pretty up in the air tbh I don't think anyone really understands what they are doing here nearly as well as e.g. cannoneers in 1850 understood what they were doing.")

Comment by Daniel Kokotajlo (daniel-kokotajlo) on ejenner's Shortform · 2024-03-11T19:31:57.679Z · LW · GW

Awesome post! Very good for people to keep track of how they've changed their minds.



  • I’m even more confident in “slow” (i.e. pretty continuous) takeoff, it seems increasingly clear we’ll have AIs doing a bunch of useful work before they could take over the world.

I probably disagree with this, but I'm not sure, it depends on what you mean exactly. How much useful work do you think they'll be doing prior to being able to take over the world? 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on How disagreements about Evidential Correlations could be settled · 2024-03-11T19:03:04.186Z · LW · GW

I think it would be helpful to have a worked example here -- say, the twin PD in which both players are close but not identical copies, and they are initially unsure about whether or not they are correlated (one thinks they probably are, another thinks they probably aren't) but they want to think and reflect more before making up their minds. (Case 2: As above except that they both begin thinking that they probably are.) Is this the sort of thing you ar imagining?

nning through a lot of irrelevant mathematical observations every time we need a new decision) is to run our heuristics continuously (also in decisions we care about), and keep track of which work better.

Put in terms of Logical Inductors, this amounts to taking all the traders from two Inductors, selecting those that have done best (each tested on their own Inductor), and computing their aggregate bet.


Uh oh, this is starting to sound like Oesterheld's Decision Markets stuff. 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Wei Dai's Shortform · 2024-03-08T15:48:44.384Z · LW · GW

OK, thanks. So then the mystery remains why academic philosophy isn't more interested in this.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Evidential Correlations are Subjective, and it might be a problem · 2024-03-07T19:04:15.159Z · LW · GW

I've only skimmed this post, but I think it would have been easier to grok if you used the good ol' Prisoner's Dilemma as your working example. E.g. can you argue that if the players are close, but not perfectly identical, copies, then they might or might not cooperate (even if they both use EDT) due to minor differences in the heuristics they've learned for how to compute counterfactuals?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Evidential Correlations are Subjective, and it might be a problem · 2024-03-07T19:00:25.917Z · LW · GW

and it's looking like the only way to solve it is by forcing -exploration.


That only works in the limit, not in practice, right? Or is there a way to make it work in practice, e.g. by having the agent try to approximate the limit, and reason about what would happen in the limit?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Wei Dai's Shortform · 2024-03-07T18:39:54.937Z · LW · GW

Thanks Sylvester! Yep it looks like cohesive decision theory is basically original UDT.  Do you know what the state of the art is in terms of philosophical critiques of cohesive decision theory? Any good ones? Any good responses to the critiques?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Two Tales of AI Takeover: My Doubts · 2024-03-07T15:52:49.691Z · LW · GW

Earlier you said:

The Preference Assumption: By default, AI training will result in policies endogenously forming context-independent, beyond-episode outcome-preferences.

Now you are saying that if Alex does end up Harmless as we hoped, it will have context-independent values, and also it will have context dependent beyond episode outcome-preferences, but it won't have context-independent beyond-episode outcome-preferences? It won't have "some specific state of the world" that it's pursuing at all points in time?

First of all, I didn't think CP depended on there being a specific state of the world you were aiming for. (what does that mean anyway?) It just meant you had some context-independent beyond-episode outcome-preferences (and that you plan towards them). Seems to me that 'harmlessness' = 'my actions don't cause significant harm' (which is an outcome-preference not limited to the current episode) and it seems to me that this is also context-independent because it is baked into Alex via lots of training rather than just something Alex sees in a prompt sometime.

I have other bigger objections to your arguments but this one is the one that's easiest to express right now. Thanks for writing this post btw it seems to me to be a more serious and high-quality critique of the orthodox view than e.g. Quintin & Nora's stuff.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-03-07T15:40:52.161Z · LW · GW

Yep! & I already paid out. I thought I had made some sort of public update but I guess I forgot. Thanks for the reminder.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Richard Ngo's Shortform · 2024-03-07T07:07:17.788Z · LW · GW

I am not as negative on it as you are -- it seems an improvement over the 'Bag O' Heuristics' model and the 'expected utility maximizer' model. But I agree with the critique and said something similar here:

you go on to talk about shards eventually values-handshaking with each other. While I agree that shard theory is a big improvement over the models that came before it (which I call rational agent model and bag o' heuristics model) I think shard theory currently has a big hole in the middle that mirrors the hole between bag o' heuristics and rational agents. Namely, shard theory currently basically seems to be saying "At first, you get very simple shards, like the following examples: IF diamond-nearby THEN goto diamond. Then, eventually, you have a bunch of competing shards that are best modelled as rational agents; they have beliefs and desires of their own, and even negotiate with each other!" My response is "but what happens in the middle? Seems super important! Also haven't you just reproduced the problem but inside the head?" (The problem being, when modelling AGI we always understood that it would start out being just a crappy bag of heuristics and end up a scary rational agent, but what happens in between was a big and important mystery. Shard theory boldly strides into that dark spot in our model... and then reproduces it in miniature! Progress, I guess.)

Alex Turner replied with this:

A shot at the diamond-alignment problem — LessWrong

I think the hole is somewhat smaller than you make out, but still substantial. From The shard theory of human values:

when the baby has a proto-world model, the reinforcement learning process takes advantage of that new machinery by further developing the juice-tasting heuristics. Suppose the baby models the room as containing juice within reach but out of sight. Then, the baby happens to turn around, which activates the already-trained reflex heuristic of “grab and drink juice you see in front of you.” In this scenario, “turn around to see the juice” preceded execution of “grab and drink the juice which is in front of me”, and so the baby is reinforced for turning around to grab the juice in situations where the baby models the juice as behind herself. 

By this process, repeated many times, the baby learns how to associate world model concepts (e.g. “the juice is behind me”) with the heuristics responsible for reward (e.g. “turn around” and “grab and drink the juice which is in front of me”). Both parts of that sequence are reinforced. In this way, the contextual-heuristics become intertwined with the budding world model. 


While all of this is happening, many different shards of value are also growing, since the human reward system offers a range of feedback signals. Many subroutines are being learned, many heuristics are developing, and many proto-preferences are taking root. At this point, the brain learns a crude planning algorithm, because proto-planning subshards (e.g. IF motor-command-5214 predicted to bring a juice pouch into view, THEN execute) would be reinforced for their contributions to activating the various hardcoded reward circuits. This proto-planning is learnable because most of the machinery was already developed by the self-supervised predictive learning, when e.g. learning to predict the consequences of motor commands (see Appendix A.1). 

The planner has to decide on a coherent plan of action. That is, micro-incoherences (turn towards juice, but then turn back towards a friendly adult, but then turn back towards the juice, ad nauseum) should generally be penalized away. Somehow, the plan has to be coherent, integrating several conflicting shards. We find it useful to view this integrative process as a kind of “bidding.” For example, when the juice-shard activates, the shard fires in a way which would have historically increased the probability of executing plans which led to juice pouches. We’ll say that the juice-shard is bidding for plans which involve juice consumption (according to the world model), and perhaps bidding against plans without juice consumption. 

I have some more models beyond what I've shared publicly, and eg one of my MATS applicants proposed an interesting story for how the novelty-shard forms, and also proposed one tack of research for answering how value negotiation shakes out (which is admittedly at the end of the gap). But overall I agree that there's a substantial gap here. I've been working on writing out pseudocode for what shard-based reflective planning might look like. 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Two Tales of AI Takeover: My Doubts · 2024-03-06T00:28:21.067Z · LW · GW

By contrast, it at least does not seem obvious that μH needs to encode well-specified outcome-preferences that motivate its responses across episodes. Our HHH-assistant μH will, given some input, need to possess situation-relative preferences-over-outcomes — these might include (say) prompt-induced goals to perform a certain financial trade, or even longer-term goals to help a company remain profitable. Still, such ‘goals’ may emerge in a purely prompt-dependent manner, without the policy pursuing local goals in virtue of its underlying consequentialist preferences.

Isn't "Harmlessness" an example of CP? If the model is truly Harmless that means it is thinking about how to avoid causing harm to people, and that this thinking isn't limited to specific prompts but rather is baked in to its behavior more generally.