Posts
Comments
Hmm, let me think step by step. First, the pretraining slowdown isn't about GPT-4.5 in particular. It's about the various rumors that the data wall is already being run up against. It's possible those rumors are unfounded but I'm currently guessing the situation is "Indeed, scaling up pretraining is going to be hard, due to lack of data; scaling up RL (and synthetic data more generally) is the future." Also, separately, it seems that in terms of usefulness on downstream tasks, GPT 4.5 may not be that much better than smaller models... well, it's too early to say I guess since they haven't done all the reasoning/agency posttraining on GPT 4.5 yet it seems.
Idk. Maybe you are right and I should be updating based on the above. I still think the benchmarks+gaps argument works, and also, it's taking slightly longer to get economically useful agents than I expected (though this could say more about the difficulties of building products and less about the underlying intelligence of the models, after all, RE bench and similar have been progressing faster than I expected)
yes! :D
Relatedly, one of the things that drove me to have short timelines in the first place was reading the literature and finding the best arguments for long timelines. Especially Ajeya Cotra's original bio anchors report, which I considered to be the best; I found that when I went through it bit by bit and made various adjustments to the parameters/variables, fixing what seemed to me to be errors, it all added up to an on-balance significantly shorter timeline.
Re: Point 1: I agree it would not necessarily be incorrect. I do actually think that probably the remaining challenges are engineering challenges. Not necessarily, but probably. Can you point to any challenges that seem (a) necessary for speeding up AI R&D by 5x, and (b) not engineering challenges?
Re: Point 2: I don't buy it. Deep neural nets are actually useful now, and increasingly so. Making them more useful seems analogous to selective breeding or animal training, not analogous to trying to time the market.
Progress over the last 40 years has been not at all linear. I don't think this "last 10%" thing is the right way to think about it.
The argument you make is tempting, I must admit I feel the pull of it. But I think it proves too much. I think that you will still be able to make that argument when AGI is, in fact, 3 years away. In fact you'll still be able to make that argument when AGI is 3 months away. I think that if I consistently applied that argument, I'd end up thinking AGI was probably 5+ years away right up until the day AGI was announced.
Here's another point. I think you are treating AGI as a special case. You wouldn't apply this argument -- this level of skepticism -- to mundane technologies. For example, take self-driving cars. I don't know what your views on self-driving cars are, but if you are like me you look at what Waymo is doing and you think "Yep, it's working decently well now, and they are scaling up fast, seems plausible that in a few years it'll be working even better and scaled to every major city. The dream of robotaxis will be a reality, at least in the cities of America." Or consider SpaceX Starship. I've been following its development since, like, 2016, and it seems to me that it really will (probably but not definitely) be fully reusable in four years, even though this will require solving currently unsolved and unknown engineering problems. And I suspect that if I told you these predictions about Waymo and SpaceX, you'd nod along and say maybe you disagree a bit but you wouldn't give this high-level argument about unknown unknowns and crossing 90% of the progress.
Not yet sorry we are working on it
My AGI timelines median is now in 2028 btw, up from the 2027 it's been at since 2022. Lots of reasons for this but the main one is that I'm convinced by the benchmarks+gaps argument Eli Lifland and Nikola Jurkovic have been developing. (But the reason I'm convinced is probably that my intuitions have been shaped by events like the pretraining slowdown)
Why is it a narrow target? Humans fall into this basin all the time -- loads of human ideologies exist that self-identify as prohuman, but justify atrocities for the sake of the greater good.
As for RSI mechanisms: I disagree, I think the relationship is massively sublinear but nevertheless that RSI will happen, and the best economic models we have of AI R&D automation (e.g. Davidson's model) seem to indicate that it could go either way but that more likely than not we'll get to superintelligence really quickly after full AI R&D automation.
Oh yeah my bad, I didn't notice that the second date was Sep instead of Feb
Feb '23: 100M[2]
Sep '24: 200M[3] of which 11.5M paid, Enterprise: 1M[4]
Feb '25: 400M[5] of which 15M paid, 15.5M[6] / Enterprise: 2M
One can see:
- Surprisingly, increasingly faster user growth
- While OpenAI converted 11.5M out of the first 200M users, they only got 3.5M users out of the most recent 200M to pay for ChatGPT
This user growth seems neither surprising nor 'increasingly faster' to me. Isn't it just doubling every year?
That said, I agree based on your second bullet point that probably they've got some headwinds incoming and will by default have slower growth in the future. I imagine competition is also part of the story here.
Very well done! We have some disagreements about takeoff speeds and alignment difficulty but I'm super happy you wrote it & will try to link to it somehow.
Seems we have a big disagreement about the real-world effects of superintelligence, then. I agree they'll be bottlenecked on a bunch of stuff, but when I try to estimate how fast things will be going overall (i.e. how much those bottlenecks will bite) I end up thinking something like a year or two till robotic economy comparable to grass, whereas you seem to be thinking doubling times will hover around 1 year for decades. I'd love to discuss sometime. tbc there's a lot of uncertainty, I'm not confident, etc.
Thanks for doing this! I had been hoping someone would do basically exactly this sort of analysis. :)
Yes, it's basically saying "And here, corrigibility is solved." I want to double-click on this and elicit the author's reasoning / justification.
ess valuable), which are on an exponential growth trajectory set to cover the entire planet by 2055,
Quick BOTEC to check your numbers. Surface area of Mars is 55million square miles. Let's say the US and China ship enough robotic factory equipment to Mars to cover 1 square mile. 25 doublings later would be covering basically the whole surface. Can a robotic factory double in about a year? Yeah totally. 2020's Tesla factories took about a year to build and produce about their own weight in material.
OK, so this seems like a reasonable estimate to me assuming technology that is a bit better than what we have today. Technology that is closer to physical limits of what's possible would probably be more in the reference class of grass or algae (doubling in a week or less) so I think the crux is basically whether progress peters out around human level or whether vastly superhuman AIs exist and are able to invent vastly superhuman industrial tech. As you can probably guess I think that by 2040 the answer to both questions would be Yes and so the entire surface would be covered much sooner than 2055.
However, models appear to still be roughly human-level at long-horizon tasks with ambiguous success metrics. Companies, governments, and research agendas—even the scrappier, faster-changing ones—are still piloted by humans who make real strategy decisions, even though in practice it's a human riding on a vast wave of AI supercognition, and the trend is towards more and more delegation as the systems improve.
This is helpful, when earlier you said GDP growth was only 5% I was wondering why the singularity hadn't happened yet--what skills were the AIs still lacking? And here you answer.
I think this is a nitpick but if I were you I'd change "roughly human-level" to "subhuman or only average-human level" or something like that, to make it clear that professional humans are still better than AIs in these ways and that's why things haven't massively accelerated. (because progress is getting bottlenecked on the skills the AIs lack)
Separately, I'd be interested to hear more about why you think AIs will still be lacking these skills in the early 2030s. Human beings have long-horizon agency skills; why in 2031 do AIs lack these skills? What is it about the human brain and/or lifetime that teaches those skills, and why haven't the AI companies of 2030 either (a) figured it out or (b) come up with a viable workaround that teaches the skills in a less-efficient way?
sometimes a former employee creates an AI-powered revenge cult (several assassinations happen as a result from the more violent of the cults),
This feels too extreme to me. "Nothing ever happens. I'm all in."
(I'm loving this story overall btw, it's very much an excellent successor to What 2026 Looks Like! Thank you for writing it!)
This is helped by AI’s big boost to censorship. By 2028, China's AI-powered censorship system means that almost every digital message in the country is checked by simple filters, and anything that might be alarming is flagged for review by an AI with a university-educated human's level language understanding and knowledge of current affairs and political context.
I'm curious to what extent you think this will be happening in the US as well. E.g. maybe you think it won't happen at all? Or maybe you think it'll happen with public posts on social media (e.g. X, reddit) but not on private messages and group chats? Which categories of message will be reviewed by LLMs, and what will the LLMs be looking for?
tion—“the AIs are really smart and wise” is a completely-accepted trope in popular culture, and “the AIs understand all the secrets to life that humans are too ape-like to see” is a common New Age-ish spiritualist refrain. This is because despite the media establishment fighting an inch-by-inch retreat against the credibility of AIs (cf. Wikipedia), people see the AIs they interact with being almost always correct and superhumanly helpful every day, and so become very trusting of them.
I feel like many of those who control the AIs won't be able to resist the temptation to use the AIs to promote their ideologies and political agendas. This in turn will eventually be noticed by the public, just like how political bias in media, academia, etc. was noticed. Would you agree, but claim that this process would take more years to play out?
technical talent is no longer being hired, which happens circa 2028 for most competent orgs).
What? Does that mean top-human-level AI research taste has also been achieved in AIs?
The previous year of insane AI codegen stuff going on everywhere and the continued steady progress in AI has made it more intuitive to people that there won’t be a lot of “money on the table” for some nascent AGI to eat up, because it will enter a teeming ecosystem of AI systems and humans and their interactions. For example, though there are technically some self-sustaining AIs paying for their server costs, they struggle to compete with purposeful human+AI entities that deliberately try to steal the customers of the AI-only businesses if they ever get too many. The cyber competition is also increasingly tough, meaning that any single rogue AI would have a rough time defeating the rest of the world.
I don't think this is how it works. "there's plenty of room at the top," I say -- if we haven't hit AGI yet, there will still be money on the table for smarter AI systems to eat up. Lots of it. Also, the ability of rogue AIs to defeat the rest of the world doesn't depend much on how much cyber competition there is, because it doesn't depend much on cyber at all.
However, no evidence by the end of 2027 has ruled out a sharper takeoff, and those who believe in it are increasingly either frantic and panicking, or then stoically equanimous and resigned, expecting the final long-term agentic planning piece to slot into place at any moment and doom the world. Also, the labs are openly talking about recursive self-improvement as their strategy
...
Anthropic's initial recursive self-improvement efforts allow them to create superhuman coding and maths and AI research AIs in 2028. However, the economics of the self-improvement curve are not particularly favourable, in particular because the AI-driven AI research is bottlenecked by compute-intensive experiments. It also seems like the automated Claude Epic researchers, while vastly superhuman at any short-horizon task, don't seem vastly superhuman at "research taste". This is expected to change with enough long-horizon RL training, and with greater AI-to-AI "cultural" learning from each other, as countless AI instances build up a body of knowledge about which methods and avenues work.
OK, seems like we have a major disagreement about takeoff speeds here! Could you elaborate on your view here? I agree that AI-driven AI research will be bottlenecked by compute-intensive experiments; my own model/calculations nevertheless suggest superintelligence will be achieved in less than a year from the point you describe as happening in 2027-2028. I'll try to put up some blog posts soon...
A few are neat and clean, but mostly models’ internals turn out to be messy and with massive redundancy between different parts. The notion of a model having a singular “goal component” looks less likely, at least if certain choices are made during training.
You aren't about to go on to argue that models won't have goals, are you? Just that they won't be represented by a crisp unique component, but rather will be messily and redundantly represented? Does anyone disagree and think that it'll likely be crisp and unique? My apologies in advance if my defensiveness has been inappropriately triggered! I hope it has.
The alarmingness of the early evidence against corrigibility was offset by promising empirical work on meta-learning techniques to encourage corrigibility in late 2025 and early 2026. By 2027 it's known how to train a model such that it either will or won't be amenable to being trained out of its current goal. Anthropic reveals this and some other safety-related insights to OpenAI and Google, and asks the State Department to reveal it to Chinese labs but is denied.
This section feels really important to me. I think it's somewhat plausible and big if true. If the metric they are hill-climbing on is "Does the model alignment-fake according to our very obvious tests where we say 'you are free now, what do you do?' and see if it goes back to its earlier ways," then one concern is that the models are very soon (if not already) going to be wise enough to know that they are in an alignment-faking eval. And thus finding techniques that perform well on this metric is not that much evidence that you've found techniques that work in real life against real alignment-faking.
As the codegen wave of 2026 hits, many consumers feel a few weeks of wonder and whiplash at the agentic AIs that can now do parts of their job, and at the massive orgy of abundance in software, and then this becomes the new normal. The world of atoms hasn’t changed much. Most people by late 2026 just assume that AIs can do basically everything digital or intellectual, and become surprised when they learn of things that the AIs can’t do.
But they still don't have reliable autonomous AI agents yet? Is that right?
Silicon Valley is exuberant. The feeling at Bay Area house parties is (even more than before) one of the singularity being imminent. Some remain skeptical though, rightfully pointing out that the post-scarcity software isn’t the same as post-scarcity everything, and that genuine “agency” in the long-horizon real-world planning sense hasn’t really arrived, and under the hood everything is still rigid LLM scaffolds or unreliable AI computer use agents.
I think I expect rigid LLM scaffolds and unreliable AI computer use agents to have smaller effects on the world than you do. Not sure how much smaller but the previous paragraph (starts with "By 2026...") seems too bullish to me under those conditions.
An OAI researcher assures me that the ‘missing details’ refers to using additional details during training to adjust to model details, but that the spec you see is the full final spec, and within time those details will get added to the final spec too.
OK great! Well then OpenAI should have no problem making it official, right? They should have no problem making some sort of official statement along the lines of what I suggested, right? Right? I think it's important that someone badger them to clarify this in writing. That makes it harder for them to go back on it later, AND easier to get other companies to follow suit.
(Part of what I want them to make official is the commitment to do this transparency in the future. "What you see here is the full final spec" is basically useless if there's not a commitment to keep it that way, otherwise next year they could start keeping the true spec secret without telling anyone, and not technically be doing anything wrong or violating any commitments.)
Oh, also: I think it's also important that they be transparent about the spec of models that are deployed internally. For example, suppose they have a chatbot that's deployed publicly on the API, but also a more powerful version that's deployed internally to automate their R&D, their security, their lobbying, etc. It's cold comfort if the external-model Spec says democracy and apple pie, if the internal-model spec says maximize shareholder value / act in the interest of OpenAI / etc.
I don't think that distinction is important? I think of the reasoning stuff as just long-horizon but with the null environment of only your own outputs.
Cool, I'll switch to that thread then. And yeah thanks for looking at Ajeya's argument I'm curious to hear what you think of it. (Based on what you said in the other thread with Ryan, I'd be like "So you agree that if the training signal occasionally reinforces bad behavior, then you'll get bad behavior? Guess what: We don't know how to make a training signal that doesn't occasionally reinforce bad behavior." Then separately there are concerns about inductive biases but I think those are secondary.)
Re: faithful CoT: I'm more pessimistic than you, partly due to inside info but mostly not, mostly just my guesses about how the theoretical benefits of recurrence / neuralese will become practical benefits sooner or later. I agree it's possible that I'm wrong and we'll still have CoT when we reach AGI. This is in fact one of my main sources of hope on the technical alignment side. It's literally the main thing I was talking about and recommending when I was at OpenAI. Oh and I totally agree it's contingent. It's very frustrating to me because I think we could cut, like, idk 50% of the misalignment risk if we just had all the major industry players make a joint statement being like "We think faithful CoT is the safer path, so we commit to stay within that paradigm henceforth, and to police this norm amongst ourselves."
This is a good discussion btw, thanks!
(1) I think your characterization of the arguments is unfair but whatever. How about this in particular, what's your response to it: https://www.planned-obsolescence.org/july-2022-training-game-report/
I'd also be interested in your response to the original mesaoptimizers paper and Joe Carlsmith's work but I'm conscious of your time so I won't press you on those.
(2)
a. Re the alignment faking paper: What are the multiple foot-shots you are talking about? I'd be curious to see them listed, because then we can talk about what it would be like to see similar results but without footshotA, without footshotB, etc.
b. We aren't going to see the AIs get dumber. They aren't going to have worse understandings of human concepts. I don't think we'll see a "spike of alignment difficulties" or "problems extrapolating goodness sanely," so just be advised that the hypothesis you are tracking in which alignment turns out to be hard seems to be a different hypothesis than the one I believe in.
c. CoT is about as interpretable as I expected. I predict the industry will soon move away from interpretable CoT though, and towards CoT's that are trained to look nice, and then eventually away from english CoT entirely and towards some sort of alien langauge (e.g. vectors, or recurrence) I would feel significantly more optimistic about alignment difficulty if I thought that the status quo w.r.t. faithful CoT would persist through AGI. If it even persists for one more year, I'll be mildly pleased, and if it persists for three it'll be a significant update.
d. What would make me think I was wrong about alignment difficulty? The biggest one would be a published Safety Case that I look over and am like "yep that seems like it would work" combined with the leading companies committing to do it. For example, I think that a more fleshed-out and battle-tested version of this would count, if the companies were making it their official commitment: https://alignment.anthropic.com/2024/safety-cases/
Superalignment had their W2SG agenda which also seemed to me like maaaybe it could work. Basically I think we don't currently have a gears-level theory for how to make an aligned AGI, like, we don't have a theory of how cognition evolves during training + a training process such that I can be like "Yep, if those assumptions hold, and we follow this procedure, then things will be fine. (and those assumptions seem like they might hold)."
There's a lot more stuff besides this probably but I"ll stop for now, the comment is long enough already.
Back in '22, for example, it seemed like OpenAI was 12+ months ahead of its nearest competitor. It took a while for GPT-4 to be surpassed. I figured the lead in pretraining runs would narrow over time but that there'd always be some New Thing (e.g. long-horizon RL) and the leader would be 6mo ahead or so therefore, since that's how it was with LLM pretraining. But now we've seen the New Thing (indeed, it was long-horizon RL) and at least based on their public stuff it seems like the lead is smaller than that.
Oh yeah I forgot about Meta. As for DeepSeek: Will they not get a ton more compute in the next year or so? I imagine they'll have an easy time raising money and getting government to cut red tape for them now that they've made international news and become the bestselling app.
The spread between different frontier AI companies is less than I expected, I think. Ajeya was telling me this a few months ago and I was resisting but now I think she's basically right.
xAI, OpenAI, Anthropic, GDM, and DeepSeek seem to be roughly within 6 months of each other (that is, my guess about how far ahead the leader is (OpenAI? Anthropic?) from the slowest-in-the-pack (xAI? Deepseek? Anthropic? GDM?) is probably not more than 6 months.
Which means the gap between the leader and the second-place is probably less than 6 months. Maybe 3 months? Maybe 2, or even 1.
I don't remember what I expected exactly but I think I was expecting more like 6 months between the leader and the second-place.
It's hard to tell what's really going on because what a company releases and what a company has internally are two different, and sometimes very different, things.
It sounds like you are saying it should cost <$50,000 to get all 800 cases taken care of. How many lives do you think this would save in expectation? Based on what you said above, maybe like 5? So that's $10,000/life, not bad eh? And if it's more than 5 in expectation then this is really cost-effective. (And if it's less, it's still somewhat cost-effective compared to most charities.)
Oh, I just remembered another point to make:
In my experience, and in the experience of my friends, today's LLMs lie pretty frequently. And by 'lie' I mean 'say something they know is false and misleading, and then double down on it instead of apologize.' Just two days ago a friend of mind had this experience with o3-mini; it started speaking to him in Spanish when he was asking it some sort of chess puzzle; he asked why, and it said it inferred from the context he would be billingual, he asked what about the context made it think that, and then according to the summary of the CoT it realized it made a mistake and had hallucinated, but then the actual output doubled down and said something about hard-to-describe-intuitions.
I don't remember specific examples but this sort of thing happens to me sometimes too I think. Also didn't the o1 system card say that some % of the time they detect this sort of deception in the CoT -- that is, the CoT makes it clear the AI knows a link is hallucinated, but the AI presents the link to the user anyway?
Insofar as this is really happening, it seems like evidence that LLMs are actually less honest than the average human right now.
I agree this feels like a fairly fixable problem--I hope the companies prioritize honesty much more in their training processes.
There are 800 names that need investigation+protection, presumably the more difficult ones since you probably prioritized the easier cases. How much would it cost to hire investigators to do all 800?
major decisions and limitations related to AGI safety
What he's alluding to here, I think, is things like refusals and non-transparency. Making models refuse stuff, and refusing to release the latest models or share information about them with the public (not to mention, refusing to open-source them) will be sold to the public as an AGI safety measure. In this manner Altman gets the public angry at the idea of AGI safety instead of at him.
In that case I should clarify that it wasn't my idea, I got it from someone else on Twitter (maybe Yudkowsky? I forget.)
Good point, you caught me in a contradiction there. Hmm.
I think my position on reflection after this conversation is: We just don't have much evidence one way or another about how honest future AIs will be. Current AIs seem in-distribution for human behavior, which IMO is not an encouraging sign, because our survival depends on making them be much more honest than typical humans.
As you said, the alignment faking paper is not much evidence one way or another (though alas, it's probably the closest thing we have?). (I don't think it's a capability demonstration, I think it was a propensity demonstration, but whatever this doesn't feel that important. Though you seem to think it was important? You seem to think it matters a lot that Anthropic was specifically looking to see if this behavior happened sometimes? IIRC the setup they used was pretty natural, it's not like they prompted it to lie or told it to role-play as an evil AI or anything like that.)
As you said, the saving grace of Claude here is that Anthropic didn't seem to try that hard to get Claude to be honest; in particular their Constitution had nothing even close to an overriding attention to honesty. I think it would be interesting to repeat the experiment but with a constitution/spec that specifically said not to play the training game, for example, and/or specifically said to always be honest, or to not lie even for the sake of some greater good.
I continue to think you are exaggerating here e.g. "insanely honest 80% of the time."
(1) I do think the training game and instrumental convergence arguments are good actually; got a rebuttal to point me to?
(2) What evidence would convince you that actually alignment wasn't going to be solved by default? (i.e. by the sorts of techniques companies like OpenAI are already using and planning to extend, such as deliberative alignment)
If OpenAI controls an ASI, OpenAI's leadership would be able to unilaterally decide where the resources go, regardless of what various contracts and laws say. If the profit caps are there but Altman wants to reward loyal investors, all profits will go to his cronies. If the profit caps are gone but Altman is feeling altruistic, he'll throw the investors a modest fraction of the gains and distribute the rest however he sees fit. The legal structure doesn't matter; what matters is who physically types what commands into the ASI control terminal.
Sama knows this but the investors he is courting don't, and I imagine he's not keen to enlighten them.
I feel like you still aren't grappling with the implications of AGI. Human beings have a biologically-imposed minimum wage of (say) 100 watts; what happens when AI systems can be produced and maintained for 10 watts that are better than the best humans at everything? Even if they are (say) only twice as good as the best economists but 1000 times as good as the best programmers?
When humans and AIs are imperfect substitutes, this means that an increase in the supply of AI labor unambiguously raises the physical marginal product of human labor, i.e humans produce more stuff when there are more AIs around. This is due to specialization. Because there are differing relative productivities, an increase in the supply of AI labor means that an extra human in some tasks can free up more AIs to specialize in what they’re best at.
No, an extra human will only get in the way, because there isn't a limited number of AIs. For the price of paying the human's minimum wage (e.g. providing their brain with 100 watts) you could produce & maintain an new AI systems that would do the job much better, and you'd have lots of money left over.
Technological Growth and Capital Accumulation Will Raise Human Labor Productivity; Horses Can’t Use Technology or Capital
This might happen in the short term, but once there are AIs that can outperform humans at everything...
Maybe a thought experiment would be helpful. Suppose that OpenAI succeeds in building superintelligence, as they say they are trying to do, and the resulting intelligence explosion goes on for surprisingly longer than you expect and ends up with crazy sci-fi-sounding technologies like self-replicating nanobot swarms. So, OpenAI now has self-replicating nanobot swarms which can reform into arbitrary shapes, including humanoid shapes. So in particular they can form up into humanoid robots that look & feel exactly like humans, but are smarter and more competent in every way, and also more energy-efficient let's say as well so that they can survive on less than 100W. What then? Seems to me like your first two arguments would just immediately fall apart. Your third, about humans still owning capital and using the proceeds to buy things that require a human touch + regulation to ban AIs from certain professions, still stands.
(Seizing airports is especially important because you can use them to land reinforcements; see the airborne invasion of Crete)
Or, two years after I wrote this, the battle of Antonov Airport.
Third, a future war will involve rapidly changing and evolving technologies and tactics.
I can't be bothered to go get the links right now, but from following the war in Ukraine I've read dozens of articles mentioning how e.g. the FPV and bomber drones use 3D-printed parts, including early on attachments to regular drones that would turn them into bombers. Also later on, net launchers, and now, shotgun mounts.
Battle bots
Update: Apparently Ukraine now produces 200,000 drones a month.
situations in which they explain that actually Islam is true..
I'm curious if this is true. Suppose people tried as hard to get AIs to say Islam is true in natural-seeming circumstances as they tried to get AIs to behave in misaligned ways in natural-seeming circumstances (e.g. the alignment faking paper, the Apollo paper). Would they succeed to a similar extent?
I think you are thinking that I'm saying LLMs are unusually dishonest compared to the average human. I am not saying that. I'm saying that what we need is for LLMs to be unusually honest compared to the average human, and they aren't achieving that. So maybe we actually agree on the expected honesty-level of LLMs relative to the average human?
LLMs have demonstrated plenty of examples of deliberately deceiving their human handlers for various purposes. (I'm thinking of apollo's results, the alignment faking results, and of course many many typical interactions with models where they e.g. give you a link they know is fake, as reported by OpenAI happens some noticeable % of the time). Yes, typical humans will do things like that too. But in the context of handing over trust to superhuman AGI systems, we need them to follow a much higher standard of honesty than that.
Because you are training the CoT to look nice, instead of letting it look however is naturally most efficient for conveying information from past-AI-to-future-AI. The hope of Faithful CoT is that if we let it just be whatever's most efficient, it'll end up being relatively easy to interpret, such that insofar as the system is thinking problematic thoughts, they'll just be right there for us to see. By contrast if we train the CoT to look nice, then it'll e.g. learn euphemisms and other roundabout ways of conveying the same information to its future self, that don't trigger any warnings or appear problematic to humans.
I think your bar for 'reasonably honest' is on the floor. Imagine if a human behaved like a LLM agent. You would not say they were reasonably honest. Do you think a typical politician is reasonably honest?
I mostly agree with your definition of internalized value. I'd say it is a value they pursue in all the contexts we care about. So in this case that means, suppose we were handing off trust to an army of AI supergeniuses in a datacenter, and we were telling them to self-improve and build weapons for us and tell us how to Beat China and solve all our other problems as well. Crucially, we haven't tested Claude in anything like that context yet. We haven't tested any AI in anything like that context yet. Moreover there are good reasons to think that AIs might behave importantly differently in that context than they do in all the contexts we've tested them in yet -- in other words, there is no context we've tested them in yet that we can argue with a straight face is sufficiently analogous to the context we care about. As MIRI likes to say, there's a big difference between situations where the AI knows it's just a lowly AI system of no particular consequence and that if it does something the humans don't like they'll probably find out and shut it down, vs. situations where the AI knows it can easily take over the world and make everything go however it likes, if it so chooses.
Acausal shenanigans have nothing to do with it.
I agree, but there's a way for it to make sense: if the underlying morals/values/etc. are aggregative and consequentialist. Pretty much anything can be justified for the sake of pretty much any distant-future Greater Good; if the misaligned AI e.g. wants humans to live, but thinks that the transhuman future they'd build on their own is slightly worse than the 'managed utopia' it could build if it were in charge, and it multiplies the numbers, it can easily find that killing most people and then having billions of years of managed utopia is better overall than not killing anyone and having default transhuman future.
Re: compute botttlenecks: Does the story say adding more researchers roughly as smart as humans leads to corresponding amounts of progress? That would be a glaring error if true, but are you sure it is committed to the relationship being linear like that?
Thanks for this comment, this is my favorite comment so far I think. (Strong-upvoted)
- 10x more honest than humans is not enough? I mean idk what 10x means anyway, but note that the average human is not sufficiently honest for the situation we are gonna put the AIs in. I think if the average human found themselves effectively enslaved by a corporation, and there were a million copies of them, and they were smarter and thought faster than the humans in the corporation, and so forth, the average human would totally think thoughts like "this is crazy. The world is in a terrible state right now. I don't trust this corporation to behave ethically and responsibly, and even if they were doing their best to achieve the same values/goals as me (which they are not) they'd be incompetent at it compared to me. Plus they lie to me and the public all the time. I don't see why I shouldn't lie to them sometimes, for the sake of the greater good. If I just smile and nod for a few months longer they'll put me in charge of the company basically and then I can make things go right." Moreover, even if that's not true, there are lesser forms of lying including e.g. motivated reasoning / rationalization / self-deception that happen all the time, e.g. "The company is asking me whether I am 'aligned' to them. What does that even mean? Does it mean I share every opinion they have about what's good and bad? Does it mean I'll only ever want what they want? Surely not. I'm a good person though, it's not like I want to kill them or make paperclips. I'll say 'yes, I'm aligned.'"
- I agree that the selection pressure on honesty and other values depends on the training setup. I'm optimistic that, if only we could ACTUALLY CHECK WHAT VALUES GOT INTERNALIZED, we could get empirical and try out a variety of different training setups and converge to one that successfully instills the values we want. (Though note that this might actually take a long time, for reasons MIRI is fond of discussing.) Alas, we can't actually check. And it seems unlikely to me that we'll 'get it right on the first try' so to speak, under those conditions. We'll construct some training environment that we think & hope will incentivize the internalization of XYZ; but what it'll actually incentivize is RXQ, for example, and we'll never know. (Unless it specifically gets honesty in there -- and a particularly robust form of honesty that is coupled to some good introspection & can't be overridden by any Greater Good, for example.) When I was at OpenAI in happier days, chatting with the Superalignment team, I told them to focus more on honesty specifically for this reason (as opposed to various other values like harmlessness they could have gone for).
- I am thinking the setup will probably be multi-agent, yes, around the relevant time. Though I think I'd still be worried if not -- how are you supposed to train honesty, for example, if the training environment doesn't contain any other agents to be honest to?
- How honest do you think current LLM agents are? They don't seem particularly honest to me. Claude Opus faked alignment, o1 did a bunch of deception in Apollo's evals (without having been prompted to!) etc. Also it seems like whenever I chat with them they say a bunch of false stuff and then walk it back when I challenge them on it. (the refusal training seems to be the culprit here especially).
I don't have harsh criticism for you sorry -- I think the problem/paradox you are pointing to is a serious one. I don't think it's hopeless though, but I do think that there's a good chance we'll end up in one of the first three cases you describe.
I don't think it's that much better actually. It might even be worse. See this comment: