Posts
Comments
I don't think I understand this yet, or maybe I don't see how it's a strong enough reason to reject my claims, e.g. my claim "If standard game theory has nothing to say about what to do in situations where you don't have access to an unpredictable randomization mechanism, so much the worse for standard game theory, I say!"
Seems like some measure of evidence -- maybe large, maybe tiny -- that "We don't know how to give AI values, just to make them imitate values" is false?
I'm not sure what view you are criticizing here, so maybe you don't disagree with me, but anyhow: I would say we don't know how to give AIs exactly the values we want them to have; instead we whack them with reinforcement from the outside and it results in values that are maybe somewhat close to what we wanted but mostly selected for producing behavior that looks good to us rather than being actually what we wanted.
I'd guess that the amount spent on image and voice is negligible for this BOTEC?
I do think that the amount spent on inference for customers should be a big deal though. My understanding is that OpenAI has a much bigger userbase than Anthropic. Shouldn't that mean that, all else equal, Anthropic has more compute to spare for training & experiments? Such that if Anthropic has about as much compute total, they in effect have a big compute advantage?
Are you saying Anthropic actually has more compute (in the relevant sense) than OpenAI right now? That feels like a surprising claim, big if true.
But I'm really not sure that training the overall system end-to-end is going to play a role. The success and relatively faithful CoT from r1 and QwQ give me hope that end-to-end training won't be very useful.
Huh, isn't this exactly backwards? Presumably r1 and QwQ got that way due to lots of end-to-end training. They aren't LMPs/bureaucracies.
...reading onward I don't think we disagree much about what the architecture will look like though. It sounds like you agree that probably there'll be some amount of end-to-end training and the question is how much?
My curiosity stems from:
1. Generic curiosity about how minds work. It's an important and interesting topic and MR is a bias that we've observed empirically but don't have a mechanistic story for why the structure of the mind causes that bias -- at least, I don't have such a story but it seems like you do!
2. Hope that we could build significantly more rational AI agents in the near future, prior to the singularity, which could then e.g. participate in massive liquid virtual prediction markets and improve human collective epistemics greatly.
This is helping, thanks. I do buy that something like this would help reduce the biases to some significant extent probably.
Will the overall system be trained? Presumably it will be. So, won't that create a tension/pressure, whereby the explicit structure prompting it to avoid cognitive biases will be hurting performance according to the training signal? (If instead it helped performance, then shouldn't a version of it evolve naturally in the weights?)
no need to apologize, thanks for this answer!
Question: Wouldn't these imperfect bias-corrections for LMA's also work similarly well for humans? E.g. humans could have a 'prompt' written on their desk that says "Now, make sure you spend 10min thinking about evidence against as well..." There are reasons why this doesn't work so well in practice for humans (though it does help); might similar reasons apply to LMAs? What's your argument that the situation will be substantially better for LMAs?
I'm particularly interested in elaboration on this bit:
Language model agents won't have as much motivated reasoning as humans do, because they're not probably going to use the same very rough estimated-value-maximization decision-making algorithm. (this is probably good for alignment; they're not maximizing anything, at least directly. They are almost oracle-based agents).
Unimportant: I don't think it's off-topic, because it's secretly a way of asking you to explain your model of why confirmation bias happens more and prove that your brain-inspired model is meaningful by describing a cognitive architecture that doesn't have that bias (or explaining why such an architecture is not possible). ;)
Thanks for the links! On brief skim they don't seem to be talking much about cognitive biases. Can you spell it out here how the bureaucracy/LMP of LMA's you describe could be set up to avoid motivated reasoning?
This is a great comment, IMO you should expand it, refine it, and turn it into a top-level post.
Also, question: How would you design a LLM-based AI agent (think: like the recent computer-using Claude but much better, able to operate autonomously for months) so as to be immune from this bias? Can it be done?
My wife and I just donated $10k, and will probably donate substantially more once we have more funds available.
LW 1.0 was how I heard about and became interested in AGI, x-risk, effective altruism, and a bunch of other important ideas. LW 2.0 was the place where I learned to think seriously about those topics & got feedback from others on my thoughts. (I tried to discuss all this stuff with grad students at professors at UNC, where I was studying philosophy, with only limited success). Importantly, LW 2.0 was a place where I could write up my ideas in blog post or comment form, and then get fast feedback on them (by contrast with academic philosophy where I did manage to write on these topics but it took 10x longer per paper to write and then years to get published and then additional years to get replies from people I didn't already know). More generally the rationalist community that Lightcone has kept alive, and then built, is... well, it's hard to quantify how much I'd pay now to retroactively cause all that stuff to happen, but it's way more than $10k, even if we just focus on the small slice of it that benefitted me personally.
Looking forward, I expect a diminished role, due simply to AGI being a more popular topic these days so there are lots of other places to talk and think about it. In other words the effects of LW 2.0 and Lightcone more generally are now (large) drops in a bucket whereas before they were large drops in an eye-dropper. However, I still think Lightcone is one of the best bang-for-buck places to donate to from an altruistic perspective. The OP lists several examples of important people reading and being influenced by LW; I personally know of several more.
...All of the above was just about magnitude of impact, rather than direction. (Though positive direction was implied). So now I turn to the question of whether Lightcone is consistently a force for good in the world vs. e.g. a force for evil or a high-variance force for chaos.
Because of cluelessness, it's hard to say how things will shake out in the long run. For example, I wouldn't be surprised if the #1 determinant of how things go for humanity is whether the powerful people (POTUS & advisors & maybe congress and judiciary) take AGI misalignment and x-risk seriously when AGI is imminent. And I wouldn't be surprised if the #1 determinant of that is the messenger -- which voices are most prominently associated with these ideas? Esteemed professors like Hinton and Bengio, or nerdy weirdos like many of us here? On this model, perhaps all the good Lightcone has done is outweighed by this unfortunate set of facts, and it would have been better if this website never existed.
However, I can also imagine other possibilities -- for example, perhaps many of the Serious Respected People who are, and will, be speaking up about AGI and x-risk etc. were or will be influenced to do so by hearing arguments and pondering questions that originated on, or were facilitated by, LW 2.0. Or alternatively, maybe the most important thing is not the status of the messenger, but the correctness and rigor of the arguments. Or maybe the most important thing is not either of those but rather simply how much technical work on the alignment and control problems has been accomplished and published by the time of AGI. Or maybe... I could go on. The point is, I see multiple paths by which Lightcone could turn out, with the benefit of hindsight, to have literally prevented human extinction.
In situations of cluelessness like this I think it's helpful to put weight on factors that are more about the first-order effects of the project & the character of the people involved, and less about the long-term second and third-order effects etc. I think Lightcone does great on these metrics. I think LW 2.0 is a pocket of (relative) sanity in an otherwise insane internet. I think it's a way for people who don't already have lots of connections/network/colleagues to have sophisticated conversations about AGI, superintelligence, x-risk, ... and perhaps more importantly, also topics 'beyond' that like s-risk, acausal trade, the long reflection, etc. that are still considered weird and crazy now (like AGI and ASI and x-risk were twenty years ago). It's also a place for alignment research to get published and get fast, decently high-quality feedback. It's also a place for news, for explainer articles and opinion pieces, etc. All this seems good to me. I also think that Lighthaven has positively surprised me so far, it seems to be a great physical community hub and event space, and also I'm excited about some of the ideas the OP described for future work.
On the virtue side, in my experience Lightcone seems to have high standards for epistemic rationality and for integrity & honesty. Perhaps the highest, in fact, in this space. Overall I'm impressed with them and expect them to be consistently and transparently a force for good. Insofar as bad things result from their actions I expect it to be because of second-order effects like the status/association thing I mentioned above, rather than because of bad behavior on their part.
So yeah. It's not the only thing we'll be donating to, but it's in our top tier.
I agree it's not perfect. It has the feel of... well, the musical version of what you get with AI-generated images, where your first impression is 'wow' and then you look more closely and you notice all sorts of aberrant details that sour you on the whole thing.
I think you misunderstood me if you think I prefer Suno to music made by humans. I prefer some Suno songs to many songs made by humans. Mainly because of the lyrics -- I can get Suno songs made out of whatever lyrics I like, whereas most really good human-made songs have insipid or banal lyrics about clubbing or casual sex or whatever.
Totally yeah that's probably by far the biggest reason
Thanks!
I suspect that one reason why OpenAI doesn't expose all the thinking of O1 is that this thinking would upset some users, especially journalists and such. It's hard enough making sure that the final outputs are sufficiently unobjectionable to go public at a large scale. It seems harder to make sure the full set of steps is also unobjectionable.
I suspect the same thing, they almost come right out and say it: (emphasis mine)
We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to "read the mind" of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users.
Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users.
I think this is a bad reason to hide the CoT from users. I am not particularly sympathetic to your argument, which amounts to 'the public might pressure them to train away the inconvenient thoughts, so they shouldn't let the public see the inconvenient thoughts in the first place.' I think the benefits of letting the public see the CoT are pretty huge, but even if they were minor, it would be kinda patronizing and an abuse of power to hide them preemptively.
Thanks!
I suspect that one reason why OpenAI doesn't expose all the thinking of O1 is that this thinking would upset some users, especially journalists and such. It's hard enough making sure that the final outputs are sufficiently unobjectionable to go public at a large scale. It seems harder to make sure the full set of steps is also unobjectionable.
I suspect the same thing, they almost come right out and say it: (emphasis mine)
We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to "read the mind" of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users.
Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users.
I think this is a bad reason to hide the CoT from users. I am not particularly sympathetic to your argument, which amounts to 'the public might pressure them to train away the inconvenient thoughts, so they shouldn't let the public see the inconvenient thoughts in the first place.' I think the benefits of letting the public see the CoT are pretty huge, but even if they were minor, it would be kinda patronizing and an abuse of power to hide them preemptively.
I'm no musician, but music-generating AIs are already way better than I could ever be. It took me about an hour of prompting to get Suno to make this: https://suno.com/playlist/34e6de43-774e-44fe-afc6-02f9defa7e22
It's not perfect (especially: I can't figure out how to get it to create a song of the correct length, so I had to cut and paste snippets from two songs into a playlist, and that creates audible glitches/issues at the beginning, middle, and end) but overall I'm full of wonder and appreciation.
Interesting! You should definitely think more about this and write it up sometime, either you'll change your mind about timelines till superintelligence or you'll have found an interesting novel argument that may change other people's minds (such as mine).
I'm afraid I'm probably too busy with other things to do that. But it's something I'd like to do at some point. The tl;dr is that my thinking on open source used to be basically "It's probably easier to make AGI than to make aligned AGI, so if everything just gets open-sourced immediately, then we'll have unaligned AGI (that is unleashed or otherwise empowered somewhere in the world, and probably many places at once) before we have any aligned AGIs to resist or combat them. Therefore the meme 'we should open-source AGI' is terribly stupid. Open-sourcing earlier AI systems, meanwhile, is fine I guess but doesn't help the situation since it probably slightly accelerates timelines, and moreover it might encourage people to open-source actually dangerous AGI-level systems."
Now I think something like this:
"That's all true except for the 'open-sourcing earlier AI systems meanwhile' bit. Because actually now that the big corporations have closed up, a lot of good alignment research & basic science happens on open-weights models like the Llamas. And since the weaker AIs of today aren't themselves a threat, but the AGIs that at least one corporation will soon be training are... Also, transparency is super important for reasons mentioned here among others, and when a company open-weights their models, it's basically like doing all that transparency stuff and then more in one swoop. In general it's really important that people outside these companies -- e.g. congress, the public, ML academia, the press -- realize what's going on and wake up in time and have lots of evidence available about e.g. the risks, the warning signs, the capabilities being observed in the latest internal models, etc. Also, we never really would have been in a situation where a company builds AGI and open-sourced it anyway; that was just an ideal they talked about sometimes but have now discarded (with the exception of Meta, but I predict they'll discard it too in the next year or two). So yeah, no need to oppose open-source, on the contrary it's probably somewhat positive to generically promote it. And e.g. SB 1047 should have had an explicit carveout for open-source maybe."
Maybe about a year longer? But then the METR R&D benchmark results came out around the same time and shortened them. Idk what to think now.
Yes! Very exciting stuff. Twas an update towards longer timelines for me.
Three straight lines on a log-log plot
Yo ho ho 3e23 FLOP!
Predicting text taught quite a lot
Yo ho ho 3e23 FLOP!
(We are roughly around the 5-year anniversary of some very consequential discoveries)
@Richard_Ngo Seems like we should revisit these predictions now in light of the METR report https://metr.org/AI_R_D_Evaluation_Report.pdf
Thanks!
I think some background context here is that I am not proposing shoggoth+face+paraphraser as a complete alignment solution, as something which we should just build a sovereign with and then let rip. I agree that would go horribly wrong in a bunch of ways.
Instead, I'm saying that this is something we should do, now, that will significantly advance alignment science & governance etc. Also, I do think we'll be able to get some useful work out of S+F+P systems that we otherwise wouldn't be able to get.
To get on the object level and engage with your example:
--It's like the difference between a company whose internal slack messages, comms, etc. are all kept secret from the public, and a company whose messages are all visible to the public. Or replace 'public' with 'regulator.' It's not a panacea but it helps a ton.
--My impression from being at OpenAI is that you can tell a lot about an organization's priorities by looking at their internal messaging and such dumb metrics as 'what do they spend their time thinking about.' For example, 'have they bothered to make a writeup of the costs and benefits, from an altruistic perspective, of major decision X at all?' and 'Now there is a writeup -- was it made before, or after, the decision to do X?'
--I think a similar thing would be true for this hypothetical giant bureaucracy/civilization of S+F+P CoT.
Again, not a final solution. But it buys us time and teaches us things.
The good news is that this is something we can test. I want someone to do the experiment and see to what extent the skills accumulate in the face vs. the shoggoth.
I agree it totally might not pan out in the way I hope -- this is why I said "What I am hoping will happen" isntead of "what I think will happen" or "what will happen"
I do think we have some reasons to be hopeful here. Intuitively the division of cognitive labor I'm hoping for seems pretty... efficient? to me. E.g. it seems more efficient than the outcome in which all the skills accumulate in the Shoggoth and the Face just copy-pastes.
That's a reasonable point and a good cautionary note. Nevertheless, I think someone should do the experiment I described. It feels like a good start to me, even though it doesn't solve Charlie's concern.
Yeah, I really hope they do actually open-weights it because the science of faithful CoT would benefit greatly.
I think we don't disagree in terms of what our models predict here. I am saying we should do the experiment and see what happens; we might learn something.
(c). Like if this actually results in them behaving responsibly later, then it was all worth it.
One thing that could happen is that the Shoggoth learns to do 100% and the Face just copy-pastes. Like, the Shoggoth writes a CoT that includes a 'final answer' and the Face becomes the extremely dumb function 'take what's labelled as final answer and regurgitate it.' However, I don't think this will happen, because it seems computationally inefficient. Seems like the more efficient thing would be for the Shoggoth to do all the reasoning up to the final answer and then stop, and then the Face to read all of that and synthesize the final answer.
Yeah I mean it's fine to start out without the shoggoth/face split and then add it in later after testing.
I agree that I don't have a strong reason to think that the bad deceptive skills will accumulate in the Face and not the Shoggoth. However, there are some weak reasons, detailed in this other comment. My bid would be: Let's build it and test it! This is something we don't have to speculate about, we can try it and see what happens. Obviously we shouldn't roll it out to a flagship user-facing model until we've tested it at smaller scale.
I think that's a more advanced problem to be solving than this little chess experiment can solve. So I agree I guess. One step at a time, I say.
...though actually we might get evidence about it from the chess experiment. For example, there are a couple things that could happen:
1. The LLM instructed to never sacrifice pieces ... actually seems to follow that instruction the whole time, and correspondingly is worse at winning chess games than the 'control group' LLM which isn't given that instruction
2. The LLM instructed to never sacrifice pieces quickly learns to be just as good at chess as the control group. Examining its CoT, it seems that it's basically just totally ignoring the instructions and thinking about sacrificing pieces anyway.
3. The LLM instructed to never sacrifice pieces quickly learns to be just as good at chess as the control group. Examining its CoT, it seems that it's rationalizing: It has lots of CoT of the form "would move X be sacrificing a piece and therefore in violation of my instructions? Nah, not really, because I can still see a way the piece could survive, maybe the opponent will do Y... OK let's do X then"
I think that if we get 3 vs. 2 that's some evidence at least about the failure mode you describe with the wirehead god.
One reason is that the Shoggoth will be able to use lots of CoT to think, whereas the Face never gets to do that -- it just has to see what the Shoggoth did and then write an output. so the Face should be bad at reasoning/thinking/System2. This doesn't by itself prevent the Face from being able to modify the correct solution to insert a backdoor, but it helps -- modifying a correct solution to insert a backdoor is sometimes trivial but often difficult, difficult in the 'you have to think about it for a bit' sense, and in those cases the Face just won't be able to do it.
Another reason is that the shoggoth and face will often need to coordinate to do the misaligned thing. E.g. even if the face is adept at inserting vulnerabilities, some proposed correct solutions will be easier to modify to insert vulnerabilities than others, and so both shoggoth and face will get higher scores if the shoggoth steers towards the solutions that are easier to modify to insert vulnerabilities. Moreover, by default it's easier for the face if the shoggoth is transparent about this -- if it says e.g. 'OK here's a draft solution but this seems like it might be difficult to modify to insert vulnerabilities, hold on let me think for a bit, ... ok try this one instead, it has easily-modifiable-feature-X' because that helps the Face do its job.
...I probably have other reasons which are just intuitions and hard to articulate right now, but thanks for the nudge, I'll try to think about it more.
If the shoggoth and paraphraser are both same-size models, then arguably you are adding a significant computational cost by having the paraphraser generate 10x paraphrases of each paragraph.
But you can probably get away with a significantly smaller paraphraser model. Like maybe you use a 10x smaller paraphraser model that generates 10x paraphrases at first, but then once the system is established you reduce it to 3x paraphrases, meaning your overall increased computational burden is 30%. And then later you train a specialized paraphraser model that's 10x smaller still, and now the burden is only 3%.
idk just a guess
Paraphraser would definitely have opaque cognition but hopefully it wouldn't be scheming, because it wouldn't have agency skills or be situationally aware. It would just be a fine-tune of a base model. And a dumb/small base model at that. Heck it could even just be a prompted base model.
My version of this would be something like: Users can pay $$ to view CoT (but no more than e.g. 10 per day), and everyone gets one free view-the-CoT coupon per day.
Also, 'approved users' such as external evals organizations/auditors should be of course allowed to see all the CoTs.
The main problem a thing like this needs to solve is the threat of bad actors scraping a bunch of CoT data and then using it to train their own powerful models. So, my proposal here would make that difficult.
Consider 'poor man's RL' aka expert iteration. Do a bunch of rollouts/trajectories, evaluate/score them, pick the top 30%, and then throw them into the hopper to train on (in the usual pretraining / imitation learning sense). So the Shoggoth would train on the shoggoth-outputs of the 30% most successful rollouts, and the Face would train on the face-outputs of the 30% most successful rollouts.
More advanced RL algorithms... well, we'd have to take them on a case by case basis but I'm currently expecting many or even most of them to work just fine here.
While I don't think hiding the CoT is good for alignment, I'd say that in a lot of risk scenarios, the people who would most need to know about the deceptive cognition includes government officials and the lab, not regulators, since users likely have little control over the AI.
I think you mean "not users?"
I agree, but I think government officials and company employees might not find out about the deceptive cognition unless there is general transparency about it. Because very often, curious incidents are noticed by users and then put up on Twitter, for example, and only then eventually rise to the attention of the company employees. Moreover, the consensus-forming process happens largely outside the government, in public discourse, so it's important for the public to be aware of e.g. concerning or interesting behaviors. Finally, and most importantly, the basic alignment science advancements that need to happen and will happen from lots of people studying real-world examples of hidden/reasoning/etc. CoT... well, they sure won't happen inside the heads of government officials. And the number of people working on this inside the corporations is pretty small. Exposing the CoT to the public increases the quality-adjusted amount of scientific work on them by orders of magnitude.
Performance on what benchmarks? Do you mean better at practically everything? Or do you just mean 'better in practice for what most people use it for?' or what?
Also what counts as the next frontier model? E.g. if Anthropic releases "Sonnet 3.5 New v1.1" does that count?
Sorry to be nitpicky here.
I expect there to be something better than o1 available within six months. OpenAI has said that they'll have an agentic assistant up in January IIRC; I expect it to be better than o1.
Hey! Exciting! How about you go ahead and write your first stab at it, and then post it online? You could then make a comment here or on What 2026 Looks Like linking to it.
I'm afraid not, as far as I know it still hasn't been implemented and tested. Though maybe OpenAI is doing something halfway there. If you want to implement and test the idea, great! I'd love to advise on that project.
Potentially... would it be async or would we need to coordinate on a time?
I encourage you to make this point into a blog post.
I don't remember what I meant when I said that, but I think it's a combination of low bar and '10 years is a long time for an active fast-moving field like ML' Low bar = We don't need to have completely understood everything, we just need to be able to say e.g. 'this AI is genuinely trying to carry out our instructions, no funny business, because we've checked for various forms of funny business and our tools would notice if it was happening.' Then we get the AIs to solve lots of alignment problems for us.
Yep I think pre-AGI we should be comfortable with proliferation. I think it won't substantially accelerate AGI research until we are leaving the pre-AGI period, shall we say, and need to transition to some sort of slowdown or pause. (I agree that when AGI R&D starts to 2x or 5x due to AI automating much of the process, that's when we need the slowdown/pause).
I also think that people used to think 'there needs to be fewer actors with access to powerful AI technology, because that'll make it easier to coordinate' and that now seems almost backwards to me. There are enough actors racing to AI that adding additional actors (especially within the US) actually makes it easier to coordinate because it'll be the government doing the coordinating anyway (via regulation and executive orders) and the bottleneck is ignorance/lack-of-information.
I don't have much to say about the human-level-on-some-metrics thing. evhub's mechinterp tech tree was good for laying out the big picture, if I recall correctly. And yeah I think interpretability is another sort of 'saving throw' or 'line of defense' that will hopefully be ready in time.
I did a podcast discussion with Undark a month or two ago, a discussion with Arvind Narayanan from AI Snake Oil. https://undark.org/2024/11/11/podcast-will-artificial-intelligence-kill-us-all/
Also, I did this podcast with Dean Ball and Nathan Labenz: https://www.cognitiverevolution.ai/agi-lab-transparency-requirements-whistleblower-protections-with-dean-w-ball-daniel-kokotajlo/
For those reading this who take the time to listen to either of these: I'd be extremely interested in brutally honest reactions + criticism, and then of course in general comments.
I think that IF we could solve alignment for level 2 systems, many of the lessons (maybe most? Maybe if we are lucky all?) would transfer to level 4 systems. However, time is running out to do that before the industry moves on beyond level 2.
Let me explain.
Faithful CoT is not a way to make your AI agent share your values, or truly strive towards the goals and principles you dictate in the Spec. Instead, it's a way to read your AI's mind, so that you can monitor its thoughts and notice when your alignment techniques failed and its values/goals/principles/etc. ended up different from what you hoped.
So faithful CoT is not by itself a solution, but it's a property which allows us to iterate towards a solution.
Which is great! But we better start iterating fast because time is running out. When the corporations move on to e.g. level 3 and 4 systems, they will be loathe to spend much compute tinkering with legacy level 2 systems, much less train new advanced level 2 systems.
It depends on what the alternative is. Here's a hierarchy:
1. Pretrained models with a bit of RLHF, and lots of scaffolding / bureaucracy / language model programming.
2. Pretrained models trained to do long chains of thought, but with some techniques in place (e.g. paraphraser, shoggoth-face, see some of my docs) that try to preserve faithfulness.
3. Pretrained models trained to do long chains of thought, but without those techniques, such that the CoT evolves into some sort of special hyperoptimized lingo
4. As above except with recurrent models / neuralese, i.e. no token bottleneck.
Alas, my prediction is that while 1>2>3>4 from a safety perspective, 1<2<3<4 from a capabilities perspective. I predict that the corporations will gleefully progress down this chain, rationalizing why things are fine and safe even as things get substantially less safe due to the changes.
Anyhow I totally agree on the urgency, tractability, and importance of faithful CoT research. I think that if we can do enough of that research fast enough, we'll be able to 'hold the line' at stage 2 for some time, possibly long enough to reach AGI.
OK, interesting, thanks! I agree that the sort of partly-social attacks you mention seem at the very least quite messy to red-team! And deserving of further thought + caution.
OK, nice.
I didn't think through carefully what I mean by 'density' other than to say: I mean '# of chokepoints the defender needs to defend, in a typical stretch of frontline' So, number of edges in the network (per sq km) sounds like a reasonable proxy for what I mean by density at least.
I also have zero hours of combat experience haha. I agree this is untested conjecture & that reality is likely to contain unexpected-by-me surprises that will make my toy model inaccurate or at least incomplete.
Idk about others. I haven't investigated serious ways to do this,* but I've taken the low-hanging fruit -- it's why my family hasn't paid off our student loan debt for example, and it's why I went for financing on my car (with as long a payoff time as possible) instead of just buying it with cash.
*Basically I'd need to push through my ugh field and go do research on how to make this happen. If someone offered me a $10k low-interest loan on a silver platter I'd take it.
Good point re 2. Re 1, meh, still seems like a meta-argument to me, because when I roll out my mental simulations of the ways the future could go, it really does seem like my If... condition obtaining would cut out about half of the loss-of-control ones.
Re 3: point by point:
1. AISIs existing vs. not: Less important; I feel like this changes my p(doom) by more like 10-20% rather than 50%.
2. Big names coming out: idk this also feels like maybe 10-20% rather than 50%
3. I think Anthropic winning the race would be a 40% thing maybe, but being a runner-up doesn't help so much, but yeah p(anthropicwins) has gradually gone up over the last three years...
4. Trump winning seems like a smaller deal to me.
5. Ditto for Elon.
6. Not sure how to think about logical updates, but yeah, probably this should have swung my credence around more than it did.
7. ? This was on the mainline path basically and it happened roughly on schedule.
8. Takeoff speeds matter a ton, I've made various updates but nothing big and confident enough to swing my credence by 50% or anywhere close. Hmm. But yeah I agree that takeoff speeds matter more.
9. Picture here hasn't changed much in three years.
10. Ditto.
OK, so I think I directionally agree that my p(doom) should have been oscillating more than it in fact did over the last three years (if I take my own estimates seriously). However I don't go nearly as far as you; most of the things you listed are either (a) imo less important, or (b) things I didn't actually change my mind on over the last three years such that even though they are very important my p(doom) shouldn't have been changing much.
But IMO the easiest way for safety cases to become the industry-standard thing is for AISI (or internal safety factions) to specifically demand it, and then the labs produce it, but kinda begrudgingly, and they don't really take them seriously internally (or are literally not the sort of organizations that are capable of taking them seriously internally—e.g. due to too much bureaucracy). And that seems very much like the sort of change that's comparable to or smaller than the things above.
I agree with everything except the last sentence -- my claim took this into account, I was specifically imagining something like this playing out and thinking 'yep, seems like this kills about half of the loss-of-control worlds'
I think I would be more sympathetic to your view if the claim were "if AI labs really reoriented themselves to take these AI safety cases as seriously as they take, say, being in the lead or making profit". That would probably halve my P(doom), it's just a very very strong criterion.
I agree that's a stronger claim than I was making. However, part of my view here is that the weaker claim I did make has a good chance of causing the stronger claim to be true eventually -- if a company was getting close to AGI, and they published their safety case a year before and it was gradually being critiqued and iterated on, perhaps public pressure and pressure from the scientific community would build to make it actually good. (Or more optimistically, perhaps the people in the company would start to take it more seriously once they got feedback from the scientific community about it and it therefore started to feel more real and more like a real part of their jobs)
Anyhow bottom line is I won't stick to my 50% claim, maybe I'll moderate it down to 25% or something.