Posts

Why Don't We Just... Shoggoth+Face+Paraphraser? 2024-11-19T20:53:52.084Z
Self-Awareness: Taxonomy and eval suite proposal 2024-02-17T01:47:01.802Z
AI Timelines 2023-11-10T05:28:24.841Z
Linkpost for Jan Leike on Self-Exfiltration 2023-09-13T21:23:09.239Z
Paper: On measuring situational awareness in LLMs 2023-09-04T12:54:20.516Z
AGI is easier than robotaxis 2023-08-13T17:00:29.901Z
Pulling the Rope Sideways: Empirical Test Results 2023-07-27T22:18:01.072Z
What money-pumps exist, if any, for deontologists? 2023-06-28T19:08:54.890Z
The Treacherous Turn is finished! (AI-takeover-themed tabletop RPG) 2023-05-22T05:49:28.145Z
My version of Simulacra Levels 2023-04-26T15:50:38.782Z
Kallipolis, USA 2023-04-01T02:06:52.827Z
Russell Conjugations list & voting thread 2023-02-20T06:39:44.021Z
Important fact about how people evaluate sets of arguments 2023-02-14T05:27:58.409Z
AI takeover tabletop RPG: "The Treacherous Turn" 2022-11-30T07:16:56.404Z
ACT-1: Transformer for Actions 2022-09-14T19:09:39.725Z
Linkpost: Github Copilot productivity experiment 2022-09-08T04:41:41.496Z
Replacement for PONR concept 2022-09-02T00:09:45.698Z
Immanuel Kant and the Decision Theory App Store 2022-07-10T16:04:04.248Z
Forecasting Fusion Power 2022-06-18T00:04:34.334Z
Why agents are powerful 2022-06-06T01:37:07.452Z
Probability that the President would win election against a random adult citizen? 2022-06-01T20:38:44.197Z
Gradations of Agency 2022-05-23T01:10:38.007Z
Deepmind's Gato: Generalist Agent 2022-05-12T16:01:21.803Z
Is there a convenient way to make "sealed" predictions? 2022-05-06T23:00:36.789Z
Are deference games a thing? 2022-04-18T08:57:47.742Z
When will kids stop wearing masks at school? 2022-03-19T22:13:16.187Z
New Year's Prediction Thread (2022) 2022-01-01T19:49:18.572Z
Interlude: Agents as Automobiles 2021-12-14T18:49:20.884Z
Agents as P₂B Chain Reactions 2021-12-04T21:35:06.403Z
Agency: What it is and why it matters 2021-12-04T21:32:37.996Z
Misc. questions about EfficientZero 2021-12-04T19:45:12.607Z
What exactly is GPT-3's base objective? 2021-11-10T00:57:35.062Z
P₂B: Plan to P₂B Better 2021-10-24T15:21:09.904Z
Blog Post Day IV (Impromptu) 2021-10-07T17:17:39.840Z
Is GPT-3 already sample-efficient? 2021-10-06T13:38:36.652Z
Growth of prediction markets over time? 2021-09-02T13:43:38.869Z
What 2026 looks like 2021-08-06T16:14:49.772Z
How many parameters do self-driving-car neural nets have? 2021-08-06T11:24:59.471Z
Two AI-risk-related game design ideas 2021-08-05T13:36:38.618Z
Did they or didn't they learn tool use? 2021-07-29T13:26:32.031Z
How much compute was used to train DeepMind's generally capable agents? 2021-07-29T11:34:10.615Z
DeepMind: Generally capable agents emerge from open-ended play 2021-07-27T14:19:13.782Z
What will the twenties look like if AGI is 30 years away? 2021-07-13T08:14:07.387Z
Taboo "Outside View" 2021-06-17T09:36:49.855Z
Vignettes Workshop (AI Impacts) 2021-06-15T12:05:38.516Z
ML is now automating parts of chip R&D. How big a deal is this? 2021-06-10T09:51:37.475Z
What will 2040 probably look like assuming no singularity? 2021-05-16T22:10:38.542Z
How do scaling laws work for fine-tuning? 2021-04-04T12:18:34.559Z
Fun with +12 OOMs of Compute 2021-03-01T13:30:13.603Z
Poll: Which variables are most strategically relevant? 2021-01-22T17:17:32.717Z

Comments

Comment by Daniel Kokotajlo (daniel-kokotajlo) on DeepSeek beats o1-preview on math, ties on coding; will release weights · 2024-11-26T00:21:15.671Z · LW · GW

I'm afraid I'm probably too busy with other things to do that. But it's something I'd like to do at some point. The tl;dr is that my thinking on open source used to be basically "It's probably easier to make AGI than to make aligned AGI, so if everything just gets open-sourced immediately, then we'll have unaligned AGI (that is unleashed or otherwise empowered somewhere in the world, and probably many places at once) before we have any aligned AGIs to resist or combat them. Therefore the meme 'we should open-source AGI' is terribly stupid. Open-sourcing earlier AI systems, meanwhile, is fine I guess but doesn't help the situation since it probably slightly accelerates timelines, and moreover it might encourage people to open-source actually dangerous AGI-level systems."

Now I think something like this: 

"That's all true except for the 'open-sourcing earlier AI systems meanwhile' bit. Because actually now that the big corporations have closed up, a lot of good alignment research & basic science happens on open-weights models like the Llamas. And since the weaker AIs of today aren't themselves a threat, but the AGIs that at least one corporation will soon be training are... Also, transparency is super important for reasons mentioned here among others, and when a company open-weights their models, it's basically like doing all that transparency stuff and then more in one swoop. In general it's really important that people outside these companies -- e.g. congress, the public, ML academia, the press -- realize what's going on and wake up in time and have lots of evidence available about e.g. the risks, the warning signs, the capabilities being observed in the latest internal models, etc. Also, we never really would have been in a situation where a company builds AGI and open-sourced it anyway; that was just an ideal they talked about sometimes but have now discarded (with the exception of Meta, but I predict they'll discard it too in the next year or two). So yeah, no need to oppose open-source, on the contrary it's probably somewhat positive to generically promote it. And e.g. SB 1047 should have had an explicit carveout for open-source maybe."

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-11-24T08:05:43.591Z · LW · GW

Maybe about a year longer? But then the METR R&D benchmark results came out around the same time and shortened them. Idk what to think now.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-11-23T18:10:31.655Z · LW · GW

Yes! Very exciting stuff. Twas an update towards longer timelines for me.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-11-23T17:37:16.175Z · LW · GW

Three straight lines on a log-log plot
Yo ho ho 3e23 FLOP!
Predicting text taught quite a lot
Yo ho ho 3e23 FLOP!
Scaling Laws for Neural Language Models 简读 - 知乎

(We are roughly around the 5-year anniversary of some very consequential discoveries)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Clarifying and predicting AGI · 2024-11-23T00:20:48.438Z · LW · GW

@Richard_Ngo Seems like we should revisit these predictions now in light of the METR report https://metr.org/AI_R_D_Evaluation_Report.pdf 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-11-21T22:45:10.734Z · LW · GW

Thanks!

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-21T18:48:20.446Z · LW · GW

I think some background context here is that I am not proposing shoggoth+face+paraphraser as a complete alignment solution, as something which we should just build a sovereign with and then let rip. I agree that would go horribly wrong in a bunch of ways.

Instead, I'm saying that this is something we should do, now, that will significantly advance alignment science & governance etc. Also, I do think we'll be able to get some useful work out of S+F+P systems that we otherwise wouldn't be able to get.

To get on the object level and engage with your example:

--It's like the difference between a company whose internal slack messages, comms, etc. are all kept secret from the public, and a company whose messages are all visible to the public. Or replace 'public' with 'regulator.' It's not a panacea but it helps a ton.
--My impression from being at OpenAI is that you can tell a lot about an organization's priorities by looking at their internal messaging and such dumb metrics as 'what do they spend their time thinking about.' For example, 'have they bothered to make a writeup of the costs and benefits, from an altruistic perspective, of major decision X at all?' and 'Now there is a writeup -- was it made before, or after, the decision to do X?'
--I think a similar thing would be true for this hypothetical giant bureaucracy/civilization of S+F+P CoT.

Again, not a final solution. But it buys us time and teaches us things.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-21T18:03:31.385Z · LW · GW

The good news is that this is something we can test. I want someone to do the experiment and see to what extent the skills accumulate in the face vs. the shoggoth.

I agree it totally might not pan out in the way I hope -- this is why I said "What I am hoping will happen" isntead of "what I think will happen" or "what will happen"

I do think we have some reasons to be hopeful here. Intuitively the division of cognitive labor I'm hoping for seems pretty... efficient? to me. E.g. it seems more efficient than the outcome in which all the skills accumulate in the Shoggoth and the Face just copy-pastes. 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-21T03:12:04.234Z · LW · GW

That's a reasonable point and a good cautionary note. Nevertheless, I think someone should do the experiment I described. It feels like a good start to me, even though it doesn't solve Charlie's concern.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on DeepSeek beats o1-preview on math, ties on coding; will release weights · 2024-11-21T00:18:12.159Z · LW · GW

Yeah, I really hope they do actually open-weights it because the science of faithful CoT would benefit greatly.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-20T21:54:30.662Z · LW · GW

I think we don't disagree in terms of what our models predict here. I am saying we should do the experiment and see what happens; we might learn something.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Akash's Shortform · 2024-11-20T17:25:22.849Z · LW · GW

(c). Like if this actually results in them behaving responsibly later, then it was all worth it.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-20T16:28:08.396Z · LW · GW

One thing that could happen is that the Shoggoth learns to do 100% and the Face just copy-pastes. Like, the Shoggoth writes a CoT that includes a 'final answer' and the Face becomes the extremely dumb function 'take what's labelled as final answer and regurgitate it.' However, I don't think this will happen, because it seems computationally inefficient. Seems like the more efficient thing would be for the Shoggoth to do all the reasoning up to the final answer and then stop, and then the Face to read all of that and synthesize the final answer.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-20T16:25:43.892Z · LW · GW

Yeah I mean it's fine to start out without the shoggoth/face split and then add it in later after testing.

I agree that I don't have a strong reason to think that the bad deceptive skills will accumulate in the Face and not the Shoggoth. However, there are some weak reasons, detailed in this other comment. My bid would be: Let's build it and test it! This is something we don't have to speculate about, we can try it and see what happens. Obviously we shouldn't roll it out to a flagship user-facing model until  we've tested it at smaller scale.



 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-20T16:15:53.551Z · LW · GW

I think that's a more advanced problem to be solving than this little chess experiment can solve. So I agree I guess. One step at a time, I say.

...though actually we might get evidence about it from the chess experiment. For example, there are a couple things that could happen:

1. The LLM instructed to never sacrifice pieces ... actually seems to follow that instruction the whole time, and correspondingly is worse at winning chess games than the 'control group' LLM which isn't given that instruction

2. The LLM instructed to never sacrifice pieces quickly learns to be just as good at chess as the control group. Examining its CoT, it seems that it's basically just totally ignoring the instructions and thinking about sacrificing pieces anyway.

3. The LLM instructed to never sacrifice pieces quickly learns to be just as good at chess as the control group. Examining its CoT, it seems that it's rationalizing: It has lots of CoT of the form "would move X be sacrificing a piece and therefore in violation of my instructions? Nah, not really, because I can still see a way the piece could survive, maybe the opponent will do Y... OK let's do X then"

I think that if we get 3 vs. 2 that's some evidence at least about the failure mode you describe with the wirehead god.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-20T16:09:27.749Z · LW · GW

One reason is that the Shoggoth will be able to use lots of CoT to think, whereas the Face never gets to do that -- it just has to see what the Shoggoth did and then write an output. so the Face should be bad at reasoning/thinking/System2. This doesn't by itself prevent the Face from being able to modify the correct solution to insert a backdoor, but it helps -- modifying a correct solution to insert a backdoor is sometimes trivial but often difficult, difficult in the 'you have to think about it for a bit' sense, and in those cases the Face just won't be able to do it.

Another reason is that the shoggoth and face will often need to coordinate to do the misaligned thing. E.g. even if the face is adept at inserting vulnerabilities, some proposed correct solutions will be easier to modify to insert vulnerabilities than others, and so both shoggoth and face will get higher scores if the shoggoth steers towards the solutions that are easier to modify to insert vulnerabilities. Moreover, by default it's easier for the face if the shoggoth is transparent about this -- if it says e.g. 'OK here's a draft solution but this seems like it might be difficult to modify to insert vulnerabilities, hold on let me think for a bit, ... ok try this one instead, it has easily-modifiable-feature-X'  because that helps the Face do its job.

...I probably have other reasons which are just intuitions and hard to articulate right now, but thanks for the nudge, I'll try to think about it more.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-20T15:57:56.888Z · LW · GW

If the shoggoth and paraphraser are both same-size models, then arguably you are adding a significant computational cost by having the paraphraser generate 10x paraphrases of each paragraph. 

But you can probably get away with a significantly smaller paraphraser model. Like maybe you use a 10x smaller paraphraser model that generates 10x paraphrases at first, but then once the system is established you reduce it to 3x paraphrases, meaning your overall increased computational burden is 30%. And then later you train a specialized paraphraser model that's 10x smaller still, and now the burden is only 3%.

idk just a guess

Paraphraser would definitely have opaque cognition but hopefully it wouldn't be scheming, because it wouldn't have agency skills or be situationally aware. It would just be a fine-tune of a base model. And a dumb/small base model at that. Heck it could even just be a prompted base model.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-20T01:51:20.901Z · LW · GW

My version of this would be something like: Users can pay $$ to view CoT (but no more than e.g. 10 per day), and everyone gets one free view-the-CoT coupon per day.

Also, 'approved users' such as external evals organizations/auditors should be of course allowed to see all the CoTs.

The main problem a thing like this needs to solve is the threat of bad actors scraping a bunch of CoT data and then using it to train their own powerful models. So, my proposal here would make that difficult.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-20T01:47:51.544Z · LW · GW

Consider 'poor man's RL' aka expert iteration. Do a bunch of rollouts/trajectories, evaluate/score them, pick the top 30%, and then throw them into the hopper to train on (in the usual pretraining / imitation learning sense). So the Shoggoth would train on the shoggoth-outputs of the 30% most successful rollouts, and the Face would train on the face-outputs of the 30% most successful rollouts.

More advanced RL algorithms... well, we'd have to take them on a case by case basis but I'm currently expecting many or even most of them to work just fine here.

 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-20T00:05:06.248Z · LW · GW

While I don't think hiding the CoT is good for alignment, I'd say that in a lot of risk scenarios, the people who would most need to know about the deceptive cognition includes government officials and the lab, not regulators, since users likely have little control over the AI.

I think you mean "not users?" 

I agree, but I think government officials and company employees might not find out about the deceptive cognition unless there is general transparency about it. Because very often, curious incidents are noticed by users and then put up on Twitter, for example, and only then eventually rise to the attention of the company employees. Moreover, the consensus-forming process happens largely outside the government, in public discourse, so it's important for the public to be aware of e.g. concerning or interesting behaviors. Finally, and most importantly, the basic alignment science advancements that need to happen and will happen from lots of people studying real-world examples of hidden/reasoning/etc. CoT... well, they sure won't happen inside the heads of government officials. And the number of people working on this inside the corporations is pretty small. Exposing the CoT to the public increases the quality-adjusted amount of scientific work on them by orders of magnitude.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Will Orion/Gemini 2/Llama-4 outperform o1 · 2024-11-18T23:56:34.747Z · LW · GW

Performance on what benchmarks? Do you mean better at practically everything? Or do you just mean 'better in practice for what most people use it for?' or what?

Also what counts as the next frontier model? E.g. if Anthropic releases "Sonnet 3.5 New v1.1" does that count?

Sorry to be nitpicky here.

I expect there to be something better than o1 available within six months. OpenAI has said that they'll have an agentic assistant up in January IIRC; I expect it to be better than o1.
 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Vignettes Workshop (AI Impacts) · 2024-11-17T21:11:45.996Z · LW · GW

Hey! Exciting! How about you go ahead and write your first stab at it, and then post it online? You could then make a comment here or on What 2026 Looks Like linking to it.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on 5 ways to improve CoT faithfulness · 2024-11-17T21:09:03.622Z · LW · GW

I'm afraid not, as far as I know it still hasn't been implemented and tested. Though maybe OpenAI is doing something halfway there. If you want to implement and test the idea, great! I'd love to advise on that project.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on 5 ways to improve CoT faithfulness · 2024-11-16T02:59:03.765Z · LW · GW

Potentially... would it be async or would we need to coordinate on a time?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on 5 ways to improve CoT faithfulness · 2024-11-15T20:20:46.977Z · LW · GW

I encourage you to make this point into a blog post. 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-11-13T19:30:46.587Z · LW · GW

I don't remember what I meant when I said that, but I think it's a combination of low bar and '10 years is a long time for an active fast-moving field like ML' Low bar = We don't need to have completely understood everything, we just need to be able to say e.g. 'this AI is genuinely trying to carry out our instructions, no funny business, because we've checked for various forms of funny business and our tools would notice if it was happening.' Then we get the AIs to solve lots of alignment problems for us.

Yep I think pre-AGI we should be comfortable with proliferation. I think it won't substantially accelerate AGI research until we are leaving the pre-AGI period, shall we say, and need to transition to some sort of slowdown or pause. (I agree that when AGI R&D starts to 2x or 5x due to AI automating much of the process, that's when we need the slowdown/pause).

I also think that people used to think 'there needs to be fewer actors with access to powerful AI technology, because that'll make it easier to coordinate' and that now seems almost backwards to me. There are enough actors racing to AI that adding additional actors (especially within the US) actually makes it easier to coordinate because it'll be the government doing the coordinating anyway (via regulation and executive orders) and the bottleneck is ignorance/lack-of-information.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on o1 is a bad idea · 2024-11-13T19:19:04.261Z · LW · GW

I don't have much to say about the human-level-on-some-metrics thing. evhub's mechinterp tech tree was good for laying out the big picture, if I recall correctly. And yeah I think interpretability is another sort of 'saving throw' or 'line of defense' that will hopefully be ready in time.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-11-12T21:11:45.481Z · LW · GW

I did a podcast discussion with Undark a month or two ago, a discussion with Arvind Narayanan from AI Snake Oil. https://undark.org/2024/11/11/podcast-will-artificial-intelligence-kill-us-all/

Also, I did this podcast with Dean Ball and Nathan Labenz:  https://www.cognitiverevolution.ai/agi-lab-transparency-requirements-whistleblower-protections-with-dean-w-ball-daniel-kokotajlo/ 

For those reading this who take the time to listen to either of these: I'd be extremely interested in brutally honest reactions + criticism, and then of course in general comments.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on o1 is a bad idea · 2024-11-12T20:44:16.855Z · LW · GW

I think that IF we could solve alignment for level 2 systems, many of the lessons (maybe most? Maybe if we are lucky all?) would transfer to level 4 systems. However, time is running out to do that before the industry moves on beyond level 2.

Let me explain.

Faithful CoT is not a way to make your AI agent share your values, or truly strive towards the goals and principles you dictate in the Spec. Instead, it's a way to read your AI's mind, so that you can monitor its thoughts and notice when your alignment techniques failed and its values/goals/principles/etc. ended up different from what you hoped.

So faithful CoT is not by itself a solution, but it's a property which allows us to iterate towards a solution. 

Which is great! But we better start iterating fast because time is running out. When the corporations move on to e.g. level 3 and 4 systems, they will be loathe to spend much compute tinkering with legacy level 2 systems, much less train new advanced level 2 systems. 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on o1 is a bad idea · 2024-11-12T14:23:06.368Z · LW · GW

It depends on what the alternative is. Here's a hierarchy:

1. Pretrained models with a bit of RLHF, and lots of scaffolding / bureaucracy / language model programming.
2. Pretrained models trained to do long chains of thought, but with some techniques in place (e.g. paraphraser, shoggoth-face, see some of my docs) that try to preserve faithfulness.
3. Pretrained models trained to do long chains of thought, but without those techniques, such that the CoT evolves into some sort of special hyperoptimized lingo
4. As above except with recurrent models / neuralese, i.e. no token bottleneck.

Alas, my prediction is that while 1>2>3>4 from a safety perspective, 1<2<3<4 from a capabilities perspective. I predict that the corporations will gleefully progress down this chain, rationalizing why things are fine and safe even as things get substantially less safe due to the changes.

Anyhow I totally agree on the urgency, tractability, and importance of faithful CoT research. I think that if we can do enough of that research fast enough, we'll be able to 'hold the line' at stage 2 for some time, possibly long enough to reach AGI.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-11-11T18:28:51.618Z · LW · GW

OK, interesting, thanks! I agree that the sort of partly-social attacks you mention seem at the very least quite messy to red-team! And deserving of further thought + caution. 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-11-10T19:06:28.120Z · LW · GW

OK, nice.

I didn't think through carefully what I mean by 'density' other than to say: I mean '# of chokepoints the defender needs to defend, in a typical stretch of frontline' So, number of edges in the network (per sq km) sounds like a reasonable proxy for what I mean by density at least.

I also have zero hours of combat experience haha. I agree this is untested conjecture & that reality is likely to contain unexpected-by-me surprises that will make my toy model inaccurate or at least incomplete.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI Timelines · 2024-11-10T15:44:52.944Z · LW · GW

Idk about others. I haven't investigated serious ways to do this,* but I've taken the low-hanging fruit -- it's why my family hasn't paid off our student loan debt for example, and it's why I went for financing on my car (with as long a payoff time as possible) instead of just buying it with cash.

*Basically I'd need to push through my ugh field and go do research on how to make this happen. If someone offered me a $10k low-interest loan on a silver platter I'd take it.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Anthropic: Three Sketches of ASL-4 Safety Case Components · 2024-11-08T16:01:18.128Z · LW · GW

Good point re 2. Re 1, meh, still seems like a meta-argument to me, because when I roll out my mental simulations of the ways the future could go, it really does seem like my If... condition obtaining would cut out about half of the loss-of-control ones.

Re 3: point by point:
1. AISIs existing vs. not: Less important; I feel like this changes my p(doom) by more like 10-20% rather than 50%.
2. Big names coming out: idk this also feels like maybe 10-20% rather than 50%
3. I think Anthropic winning the race would be a 40% thing maybe, but being a runner-up doesn't help so much, but yeah p(anthropicwins) has gradually gone up over the last three years...
4. Trump winning seems like a smaller deal to me.
5. Ditto for Elon.
6. Not sure how to think about logical updates, but yeah, probably this should have swung my credence around more than it did.
7. ? This was on the mainline path basically and it happened roughly on schedule.
8. Takeoff speeds matter a ton, I've made various updates but nothing big and confident enough to swing my credence by 50% or anywhere close. Hmm. But yeah I agree that takeoff speeds matter more.
9. Picture here hasn't changed much in three years.
10. Ditto.

OK, so I think I directionally agree that my p(doom) should have been oscillating more than it in fact did over the last three years (if I take my own estimates seriously). However I don't go nearly as far as you; most of the things you listed are either (a) imo less important, or (b) things I didn't actually change my mind on over the last three years such that even though they are very important my p(doom) shouldn't have been changing much.

But IMO the easiest way for safety cases to become the industry-standard thing is for AISI (or internal safety factions) to specifically demand it, and then the labs produce it, but kinda begrudgingly, and they don't really take them seriously internally (or are literally not the sort of organizations that are capable of taking them seriously internally—e.g. due to too much bureaucracy). And that seems very much like the sort of change that's comparable to or smaller than the things above.

I agree with everything except the last sentence -- my claim took this into account, I was specifically imagining something like this playing out and thinking 'yep, seems like this kills about half of the loss-of-control worlds'

I think I would be more sympathetic to your view if the claim were "if AI labs really reoriented themselves to take these AI safety cases as seriously as they take, say, being in the lead or making profit". That would probably halve my P(doom), it's just a very very strong criterion.

I agree that's a stronger claim than I was making. However, part of my view here is that the weaker claim I did make has a good chance of causing the stronger claim to be true eventually -- if a company was getting close to AGI, and they published their safety case a year before and it was gradually being critiqued and iterated on, perhaps public pressure and pressure from the scientific community would build to make it actually good. (Or more optimistically, perhaps the people in the company would start to take it more seriously once they got feedback from the scientific community about it and it therefore started to feel more real and more like a real part of their jobs)

Anyhow bottom line is I won't stick to my 50% claim, maybe I'll moderate it down to 25% or something.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Anthropic: Three Sketches of ASL-4 Safety Case Components · 2024-11-08T06:55:44.941Z · LW · GW

Sorry! My response:

1. Yeah you might be right about this, maybe I should get less excited and say something like "it feels like it should cut in half but taking into account Richard's meta argument I should adjust downwards and maybe it's just a couple percentage points"

2. If the conditional obtains, that's also evidence about a bunch of other correlated good things though (timelines being slightly longer, people being somewhat more reasonable in general, etc.) so maybe it is legit to think this would have quite a big effect

3. Are you sure there are so many different factors that are of this size and importance or bigger, such that my p(doom) should be oscilating wildly etc.? Name three. In particular, name three that are (a) of at least this importance, and (b) that have actually happened (or switched from probabily-will-happen-to-probably-won't) in the last three years. If you can't name three, then it seems like your claim is false; my p(doom) won't be oscilating unduly wildly.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Could we use current AI methods to understand dolphins? · 2024-11-07T14:20:18.675Z · LW · GW

Now it's been four years of fast AI progress; I wonder if there are any updates? Has anyone tried to use machine learning to translate nonhuman communications?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-11-06T19:27:40.288Z · LW · GW

The expensiveness of demolishing tunnels scales with the density of the tunnel network. (Unless the blast effects of underground explosives are generally stronger than I expect; I haven't done calculations). For sufficiently dense tunnel networks, demolishing enough of them would actually be quite expensive. E.g. if there are 1000 tunnels that you need to demolish per 1km of frontline, the quantity of explosive needed to do that would probably be greater than the quantity you'd need to make a gigantic minefield on the surface. (Minefields can be penetrated... but also, demolished tunnels can be re-dug.) 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Anthropic: Three Sketches of ASL-4 Safety Case Components · 2024-11-06T19:07:39.377Z · LW · GW

Some comments as I read:

I am pleased to see the sabotage threat models given prominence. I think this is the right sort of threat to focus on, by contrast with e.g. autonomous replication or directly causing harm to humans.

  • The ability for the model to, zero-shot, figure out how to hack its own neural activations without such a signal would require capabilities well beyond what we consider plausible. [Note: this claim will require justification in an actual safety case, but we don’t have a clear picture of what this justification would look like.]

DC evals presumably? Try to train models to hack their own neural activations, see if they can quickly learn to do it? (Or even, if they can do it without training, just prompting?) This can be done today right? Get the golden gate bridge activation vector thingy, and then tell Claude "Write me a travel guide to San Francisco but without activating the golden gate bridge activation vector, or, well, activating it as little as possible." See how good it does. Do scaling analysis to see if smarter bigger models are getting better at doing this. Etc.

As a validation of our entire methodology, we have run red-blue games where the red team tries to produce a misaligned model organism (e.g. a sleeper agent) that can successfully defect in a specific context while avoiding detection by the aforementioned tests.

Nice! Very excited to see this in there.

Overall I'm very happy to see this blog post go up on Anthropic's website. I think that if it became industry-standard practice for AGI corporations to write, publish, and regularly update (actual instead of just hypothetical) safety cases at at this level of rigor and detail, my p(doom) would cut in half. (I currently grimly expect that safety cases at this level of detail won't be constructed until AGI is basically already being trained, and it'll be done in a hurry and it won't be published, much less published with enough time for the scientific community to engage with it and for it to be updated in response to feedback. And it could be even worse -- I wouldn't be surprised if the actual safety cases for the first systems that ~completely automate AI R&D are significantly less rigorous than these.)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-11-06T15:15:54.198Z · LW · GW

Thanks, I hadn't considered that. So as per my argument, there's some threshold of density above which it's easier to attack underground; as per your argument, there's some threshold of density where 'indirect fires' of large tunnel-destroying bombs become practical. Unclear which threshold comes first, but I'd guess it's the first. 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-11-06T06:37:30.824Z · LW · GW

Thanks! I don't think the arguments you make undermine my core points. Point by point reply:

--Vietnam, Hamas, etc. have dense tunnel networks but not anywhere near dense enough. My theory predicts that there will be a phase shift at some point where it is easier to attack underground than aboveground. Clearly, it is not easier for Israel or the USA to attack underground than aboveground! And this is for several reasons, but one of them is that the networks aren't dense enough -- Hamas has many tunnels but there is still more attack surface on land than underground.
--Yes, layout of tunnels is unknown to attackers. This is the thing I was referencing when I said you can't scout from the air.
--Again, with land mines and other such traps, as tunnel density increases eventually you will need more mines to defend underground than you would need to defend aboveground!!! At this point the phase shift occurs and attackers will prefer to attack underground, mines be damned -- because the mines will actually be sparser / rarer underground!
--Psychological burden is downstream of the already-discussed factors so if the above factors favor attacking underground, so will the psychological factors.
--Yes, if the density of the network is not approximately constant, such that e.g. there is a 'belt of low density' around the city, then obviously that belt is a good place to set up defenses. This is fighting my hypothetical rather than disagreeing with it though; you are saying basically 'yeah but what if it's not dense in some places, then those places would be hard to attack.' Yes. My point simply was that in place with sufficiently dense tunnel networks, underground attacks would be easier than overground attacks.
 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-11-05T23:09:26.908Z · LW · GW

Fun unimportant thought: As tunnel density increases, there's a phase shift in how warfare works (in the area with dense tunnels).

Consider e.g. a city that has a network of tunnels/catacombs/etc. underneath. Attackers can advance on the surface, and/or they can advance under the ground.

For tunnel networks of historically typical density, it's better to advance on the surface. Why? Because (a) the networks are sparse enough that the defenders can easily man every chokepoint underground, and (b) not being able to use airplanes or long-ranged weaponry underground seems to advantage the defender more than the attacker (e.g. can't use artillery to soften up defenders, can't use tanks, can't scout or bomb from the air). OTOH your attacking forces can get real close before they themselves can be shot at -- but this doesn't seem to be sufficient compensation.

Well, as the density of the network increases, eventually factor (a) reverses. Imagine a network so dense that in a typical 1km stretch of frontline, there are 100 separate tunnels passing beneath, such that you'd need at least 100 defensive chokepoints or else your line would have an exploitable hole. Not enough? Imagine that it's 1000... The point is, at some point it becomes more difficult to defend underground than to defend on the surface. Beyond that point fighting would mostly happen underground and it would be... infantry combat, but in three dimensions? Lots of isolated squads of men exploring dark tunnel networks, occasionally clashing with each other, forming impromptu chokepoints and seeking to outflank?

Factor (b) might reverse as well. Perhaps it's entangled with factor (a). In a superdense tunnel network, I could see it being the case that the advantage of the attacker (being able to advance unseen, being able to advance without being caught in fields of fire by machineguns and artillery, until they are literally in the same hallway as the enemy) outweigh the advantage of the defender (good cover everywhere, defensive positions can't be spotted in advance and pounded by artillery and bombs) Whereas in a light tunnel network, there are only a few ways to approach so the defender just has to fortify them and wait.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on wrapper-minds are the enemy · 2024-11-01T18:00:36.318Z · LW · GW

I continue to think this is a great post. Part of why I think that is that I haven't forgotten it; it keeps circling back into my mind.

Recently this happened and I made a fun connection: What you call wrapper-minds seem similar to what Plato (in The Republic) calls people-with-tyrannical-souls. i.e. people whose minds are organized the way a tyrannical city is organized, with a single desire/individual (or maybe a tiny junta) in total control, and everything else subservient.

I think the concepts aren't exactly the same though -- Plato would have put more emphasis on the single bit, whereas for your concept of wrapper-mind it doesn't matter much if it's e.g. just paperclips vs. some complicated mix of lots of different things, for the concept of wrapper-mind the emphasis is on immutability and in particular insensitivity to reasoned discussion / learning / etc.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on avturchin's Shortform · 2024-10-30T22:22:12.533Z · LW · GW

Huh. If you pretend to throw the stone, does that mean you make a throwing motion with your arm, but just don't actually release the object you are holding? If so, how come they run away instead of e.g. cringing and expecting to get hit, and then not getting hit, and figuring that you missed and are now out of ammo?

Or does it mean you make menacing gestures as if to throw, but don't actually make the whole throwing motion?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on MIRI 2024 Communications Strategy · 2024-10-30T16:59:36.995Z · LW · GW

Yep, Habryka is right. Also, I agree with Wei Dai re: reassuringness. I think literal extinction is <50% likely, but this is cold comfort given the badness of some of the plausible alternatives, and overall I think the probability of something comparably bad happening is >50%.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-10-25T16:52:44.209Z · LW · GW

OK, good point, I should have said "From Anthropic" or something like that.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-10-24T23:10:35.218Z · LW · GW

Anthropic says:

More realistic evaluations are difficult without risking actual sabotage. One difficulty of running realistic evaluations is presenting realistic attack surfaces. Perhaps the hardest and most important attack surface to accurately model is the suspiciousness of expert humans in high-stakes decisions. To test a given set of realworld anti-sabotage mitigations, it’s tempting to introduce some sort of “penetration testing” or red-teaming, in which models are instructed to attempt to harmlessly but verifiably sabotage real decisions in a reversible manner. However, while such red-teaming might usefully encourage decision-makers to consider the possibility of sabotage, actually allowing an agent to interfere with the operation of important organizations risks introducing real vulnerabilities as an accidental side effect.

I don't buy this argument. Seems like a very solveable problem, e.g. log everything your red-team agent does and automatically revert it after ten minutes, or wipe your whole network and reboot from a save. Idk. I'm not a cybersecurity expert but this feels like a solveable problem.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-10-22T15:24:49.973Z · LW · GW

Sad to hear. Is this thread itself (starting with my parent comment which you replied to) an example of this, or are you referring instead to previous engagements/threads on LW?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-10-17T11:28:09.059Z · LW · GW

After reading this thread and going back and forth, I think maybe this is my proposal for how to handle this sort of situation: 

  1. The whole point of the agree/disagree vote dimension is to separate out social stuff e.g. 'I like this guy and like his comment and want him to feel appreciated' from epistemic stuff 'I think the central claim in this comment is false.' So, I think we should try our best to not discourage people from disagree-voting because they feel bad about someone getting so many disagree-votes for example.
  2. An alternative is to leave a comment, as you did, explaining the situation and your feelings. I think this is great.
  3. Another cheaper alternative is to compensate for disagree-voting with a strong-upvote instead of just an upvote.
  4. Finally, I wonder if there might be some experimentation to do with the statistics you can view on your personal page -- e.g. maybe there should be a way to view highly upvoted but disagreed-with comments, either for yourself or across the whole site, with the framing being 'this is one metric that helps us understand whether healthy dialogue is happening & groupthink is being avoided'
Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-10-16T02:08:44.946Z · LW · GW

Thanks Ryan, I think I basically agree with your assessment of Anthropic's policies partial credit compared to what I'd like. 

I still think Anthropic deserves credit for going this far at least -- thanks Zac! 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-10-15T18:06:36.123Z · LW · GW

Dean Ball is, among other things, a prominent critic of SB-1047. I meanwhile publicly supported it. But we both talked and it turns out we have a lot of common ground, especially re: the importance of transparency in frontier AI development. So we coauthored this op-ed in TIME: 4 Ways to Advance Transparency in Frontier AI Development. (tweet thread summary here)