Posts
Comments
One underlying idea comes from how AI misalignment is intended to work. If superintelligent AI systems are misaligned, does this misalignment look like an inaccurate generalization from what their overseers wanted, or a 'randomly rolled utility function' deceptively misaligned goal that's entirely unrelated to anything their overseers intended to train? This is represented by Levels 1-4 vs levels 5+, in my difficulty scale, more or less. If the misalignment is result of economic pressures and a 'race to the bottom' dynamic then its more likely to result in systems that care about human welfare alongside other things.
If the AI that's misaligned ends up 'egregiously' misaligned and doesn't care at all about anything valuable to us, as Eliezer thinks is most likely, then it places zero terminal value on human welfare and only trade, threats or compromise would get it to be nice. If the AI super-intelligent and you aren't, none of those considerations apply. Hence, nothing is left for humans.
If the AI is misaligned but doesn't have an arbitrary value system, then it may value human survival at least a bit and do some equivalent of leaving a hole in the dyson sphere.
For months, those who want no regulations of any kind placed upon themselves have hallucinated and fabricated information about the bill’s contents and intentionally created an internet echo chamber, in a deliberate campaign to create the impression of widespread opposition to SB 1047, and that SB 1047 would harm California’s AI industry.
There is another significant angle to add here. Namely: Many of the people in this internet echo chamber or behind this campaign are part of the network of neoreactionaries, MAGA supporters, and tech elites who want to be unaccountable that you've positioned yourself as a substantial counterpoint to.
Obviously it's invoking a culture war fight which has its downsides, but it's not just rhetoric: the charge that many bill opponents are basing their decisions on an ideology that Newsom opposes and sees as dangerous for the country, is true.
A16z and many other of the most dishonest opponents of the bill, are part of a Trump-supporting network with lots of close ties to neoreactionary thought, which opposes SB 1047 for precisely the same reason that they want Trump and republicans to win: to remove restraints on their own power in the short-to-medium term, and more broadly because they see it as a step towards making our society into one where wealthy oligarchs are given favorable treatment and can get away with anything.
It also serves as a counterpoint against the defense and competition angle, at least if its presented by a16z (this argument doesn't work for e.g. OpenAI, but there are many other good counterarguments). The claims they make about the bill harming competitiveness e.g. for defense and security against China and other adversaries ring hollow when most of them are anti-Ukraine support or anti-NATO, making it clear they don't generally care about the US maintaining its global leadership.
I think this would maybe compel Newsom who's positioned himself as an anti-MAGA figure.
I touched upon this idea indirectly in the original post when discussing alignment-related High Impact Tasks (HITs), but I didn't explicitly connect it to the potential for reducing implementation costs and you're right to point that out.
Let me clarify how the framework handles this aspect and elaborate on its implications.
Key points:
- Alignment-related HITs, such as automating oversight or interpretability research, introduce challenges and make the HITs more complicated. We need to ask, what's the difficulty of aligning a system capable of automating the alignment of systems capable of achieving HITs!
- The HIT framing is flexible enough to accommodate the use of AI for accelerating alignment research, not just for directly reducing existential risk. If full alignment automation of systems capable of performing (non alignment related) HITs is construed as an HIT, the actual alignment difficulty corresponds to the level required to align the AI system performing the automation, not the automated task itself.
- In practice, a combination of AI systems at various alignment difficulty levels will likely be employed to reduce costs and risks for both alignment-related tasks and other applications. Partial automation and acceleration by AI systems can significantly impact the cost curve for implementing advanced alignment techniques, even if full automation is not possible.
- The cost curve presented in the original post assumes no AI assistance, but in reality, AI involvement in alignment research could substantially alter its shape. This is because the cost curve covers the cost of performing research "to achieve the given HITs", and since substantially automating alignment research is a possible HIT, by definition the cost graph is not supposed to include substantial assistance on alignment research.
However, that makes it unrealistic in practice, especially because (as indicated by the haziness on the graph), there will be many HITs, both accelerating alignment research and also incrementally reducing overall risk.
To illustrate, consider a scenario where scalable oversight at level 4 of the alignment difficulty scale is used to fully automate mechninterp at level 6, and then this level 6 system can go on to say research a method of impregnable cyber-defense, rapid counter-bio-weapon vaccines, give superhuman geopolitical strategic advice, and unmask any unaligned AI present on the internet.
In this case, the actual difficulty level would be 4, with the HIT being the automation of the level 6 technique that's then used to reduce risk substantially.
I do think that Apollo themselves were clear that this was showing that it had the mental wherewithal for deception and if you apply absolutely no mitigations then deception happens. That's what I said in my recent discussion of what this does and doesn't show.
Therefore I described the 4o case as an engineered toy model of a failure at level 4-5 on my alignment difficulty scale (e.g. the dynamics of strategically faking performance on tests to pursue a large scale goal), but it is not an example of such a failure.
In contrast, the AI scientist case was a genuine alignment failure, but that was a much simpler case of non-deceptive, non-strategic, being given a sloppy goal by bad RLHF and reward hacking, just in a more sophisticated system than say coin-run (level 2-3).
The hidden part that Zvi etc skim over is that 'of course' in real life 'in the near future' we'll be in a situation where an o1-like model has instrumental incentives because it is pursuing an adversarial large scale goal and also the mitigations they could have applied (like prompting it better, doing better RLHF, doing process oversight on the chain of thought etc) won't work, but that's the entire contentious part of the argument!
One can make arguments that these oversight methods will break down e.g. when the system is generally superhuman at predicting what feedback its overseers will provide. However, those arguments were theoretical when they were made years ago and they're still theoretical now.
This does count against naive views that assume alignment failures can't possibly happen: there probably are those out there who believe that you have to give an AI system an "unreasonably malicious" rather than just "somewhat unrealistically single minded" prompt to get it to engage in deceptive behavior or just irrationally think AIs will always know what we want and therefore can't possibly be deceptive.
Good point. You're right to highlight the importance of the offense-defense balance in determining the difficulty of high-impact tasks, rather than alignment difficulty alone. This is a crucial point that I'm planning on expand on in the next post in this sequence.
Many things determine the overall difficulty of HITs:
- the "intrinsic" offense-defense balance in related fields (like biotechnology, weapons technologies and cybersecurity) and especially whether there are irresolutely offense-dominant technologies that transformative AI can develop and which can't be countered
- Overall alignment difficulty, affecting whether we should expect to see a large number of strategic, power seeking unaligned systems or just systems engaging in more mundane reward hacking and sycophancy.
- Technology diffusion rates, especially for anything offense dominant, e.g. should we expect frontier models to leak or be deliberately open sourced
- Geopolitical factors, e.g. are there adversary countries or large numbers of other well resourced rogue actors to worry about not just accidents and leaks and random individuals
- The development strategy (e.g. whether the AI technologies are being proactively developed by a government or in public-private partnership or by companies who can't or won't use them protectively)
My rough suspicion is that all of these factors matter quite a bit, but since we're looking at "the alignment problem" in this post I'm pretending that everything else is held fixed.
The intrinsic offense-defense balance of whatever is next on the 'tech tree', as you noted, is maybe the most important overall, as it affects the feasibility of defensive measures and could push towards more aggressive strategies in cases of strong offense advantage. It's also extremely difficult to predict ahead of time.
Yes, I do think constitution design is neglected! I think it's possible people think constitution changes now won't stick around or that it won't make any difference in the long term, but I do think based on the arguments here that even if it's a bit diffuse you can influence AI behavior on important structural risks by changing their constitutions. It's simple, cheap and maybe quite effective especially for failure modes that we don't have any good shovel-ready technical interventions for.
What is moral realism doing in the same taxon with fully robust and good-enough alignment? (This seems like a huge, foundational worldview gap; people who think alignment is easy still buy the orthogonality thesis.)
Technically even Moral Realism doesn't imply Anti-Orthogonality thesis! Moral Realism is necessary but not sufficient for Anti-Orthogonality, you have to be a particular kind of very hardcore platonist moral realist who believes that 'to know the good is to do the good', to be Anti-Orthogonality, and argue that not only are there moral facts but that these facts are intrinsically motivating.
Most moral realists would say that it's possible to know what's good but not act on it: even if this is an 'unreasonable' disposition in some sense, this 'unreasonableness' it's compatible with being extremely intelligent and powerful in practical terms.
Even famous moral realists like Kant wouldn't deny the Orthogonality thesis: Kant would accept that it's possible to understand hypothetical but not categorical imperatives, and he'd distinguish capital-R Reason from simple means-end 'rationality'. I think from among moral realists, it's really only platonists and divine command theorists who'd deny Orthogonality itself.
I don't believe that you can pass an Ideological Turing Test for people who've thought seriously about these issues and assign a decent probability to things going well in the long term, e.g. Paul Christiano, Carl Shulman, Holden Karnofsky and a few others.
The futures described by the likes of Carl Shulman, which I find relatively plausible, don't fit neatly into your categories but seem to be some combination of (3) (though you lump 'we can use pseudo aligned AI that does what we want it to do in the short term on well specified problems to navigate competitive pressures' in with 'AI systems will want to do what's morally best by default so we don't need alignment') and (10) (which I would phrase as 'unaligned AIs which aren't deceptively inner misaligned will place some value on human happiness and placating their overseers'), along with (4) and (6), (specifically that either an actual singleton or a coalition with strong verifiable agreements and regulation can emerge, or that coordination can be made easier by AI advice).
Both 3 and 10 I think are plausible when phrased correctly: that pseudo-aligned powerful AI can help governments and corporations navigate competitive pressures, and that AI systems (assuming they don't have some strongly deceptively misaligned goal from a sharp left turn) will still want to do things that satisfy their overseers among the other things they might want, such that there won't be strong competitive pressures to kill everyone and ignore everything their human overseers want.
In general, I'm extremely wary of arguing by 'default' in either direction. Both of these are weak arguments for similar reasons:
- The default outcome of "humans all have very capable AIs that do what they want, and the humans are otherwise free", is that humans gain cheap access to near-infinite cognitive labor for whatever they want to do, and shortly after a huge increase in resources and power over the world. This results in a super abundant utopian future.
- The default outcome of "humans all have very capable AIs that do what they want, and the humans are otherwise free", is that ‘the humans all turn everything over to their AIs and set them loose to compete’ because anyone not doing that loses out, on an individual level and on a corporate or national level, resulting in human extinction or dystopia via 'race to the bottom'
My disagreement is that I don't think there's a strong default either way. At best there is a (highly unclear) default towards futures that involve a race to the bottom, but that's it.
I'm going to be very nitpicky here but with good cause. "To compete" - for what, money, resources, social approval? "The humans" - the original developers of advanced AI, which might be corporations, corporations in close public-private partnership, or big closed government projects?
I need to know what this actually looks like in practice to assess the model, because the world is not this toy model, and the dis-analogies aren't just irrelevant details.
What happens when we get into those details?
We can try to make this specific, envision a concrete scenario involving a fairly quick full automation of everything, and think about (in a scenario where governments and corporate decision-makers have pseudo-aligned AI which will e.g. answer questions super-humanly well in the short term) a specific bad scenario, and then also imagine ways it could not happen. I've done this before, and you can do it endlessly:
Here’s one of the production web stories in brief but you can read it in full along with my old discussion here,
In the future, AI-driven management assistant software revolutionizes industries by automating decision-making processes, including "soft skills" like conflict resolution. This leads to massive job automation, even at high management levels. Companies that don't adopt this technology fall behind. An interconnected "production web" of companies emerges, operating with minimal human intervention and focusing on maximizing production. They develop a self-sustaining economy, using digital currencies and operating beyond human regulatory reach. Over time, these companies, driven by their AI-optimized objectives, inadvertently prioritize their production goals over human welfare. This misalignment leads to the depletion of essential resources like arable land and drinking water, ultimately threatening human survival, as humanity becomes unable to influence or stop these autonomous corporate entities.
My object-level response is to say something mundane along the lines of, I think each of the following is more or less independent and not extremely unlikely to occur (each is above 1% likely):
- Wouldn’t governments and regulators also have access to AI systems to aid with oversight and especially with predicting the future? Remember, in this world we have pseudo-aligned AI systems that will more or less do what their overseers want in the short term.
- Couldn’t a political candidate ask their (aligned) strategist-AI ‘are we all going to be killed by this process in 20 years’ and then make a persuasive campaign to change the public’s mind with this early in the process, using obvious evidence to their advantage
- If the world is alarmed by the expanding production web and governments have a lot of hard power initially, why will enforcement necessarily be ineffective? If there’s a shadow economy of digital payments, just arrest anyone found dealing with a rogue AI system. This would scare a lot of people.
- We’ve already seen pessimistic views about what AI regulations can achieve self-confessedly be falsified at the 98% level - there’s sleepwalk bias to consider. Stefan schubert: Yeah, if people think the policy response is "99th-percentile-in-2018", then that suggests their models have been seriously wrong. So maybe the regulations will be both effective, foresightful and well implemented with AI systems forseeing the long-run consequences of decisions and backing them up.
- What if the lead project is unitary and a singleton or the few lead projects quickly band together because they’re foresightful, so none of this race to the bottom stuff happens in the first place?
- If it gets to the point where water or the oxygen in the atmosphere is being used up (why would that happen again, why wouldn’t it just be easier for the machines to fly off into space and not have to deal with the presumed disvalue of doing something their original overseers didn’t like?) did nobody build in ‘off switches’?
- Even if they aren’t fulfilling our values perfectly, wouldn’t the production web just reach some equilibrium where it’s skimming off a small amount of resources to placate its overseers (since its various components are at least somewhat beholden to them) while expanding further and further?
And I already know the response is just going to be “Moloch wouldn’t let that happen..” and that eventually competition will mean that all of these barriers disappear. At this point though I think that such a response is too broad and proves too much. If you use the moloch idea this way it becomes the classic mistaken "one big idea universal theory of history" which can explain nearly any outcome so long as it doesn't have to predict it.
A further point: I think that someone using this kind of reasoning in 1830 would have very confidently predicted that the world of 2023 would be a horrible dystopia where wages for workers wouldn’t have improved at all because of moloch.
You can call this another example of (11), i.e. assuming the default outcome will be fine and then this arguing against a specific bad scenario, so it doesn't affect the default, but that assumes what you're trying to establish.
I'm arguing that when you get into practical details of any scenario (assuming pseudo-aligned AI and no sharp left turn or sudden emergence), you can think of ways to utilize the vast new cognitive labor force available to humanity to preempt or deal with the potential competitive race to the bottom, the offense-defense imbalance, or other challenges, which messes up the neat toy model of competitive pressures wrecking everything.
When you try to translate the toy model of:
"everyone" gets an AI that's pseudo aligned --> "everyone" gives up control and lets the AI "compete" --> "all human value is sacrificed" which presumably means we run out of resources on the earth or are killed
into a real world scenario by adding details on about who develops the AI systems, what they want, and specific ways the AI systems could be used, we also get practical, real world counterarguments as to why it might not happen. Things still get dicey the faster takeoff is and the harder alignment is, as we don't have this potential assistance to rely on and have to do everything ourselves, but we already knew that, and you correctly point out that nobody knows the real truth about alignment difficulty anyway.
To be clear, I still think that some level of competitive degradation is a default, there will be a strong competitive pressure to delegate more and more decision making to AI systems and take humans out of the loop, but this proceeding unabated towards systems that are pressured into not caring about their overseers at all, resulting in a world of 'perfect competition' with a race to the bottom that proceeds unimpeded until humans are killed, is a much weaker default than you describe in practice.
Treating a molochian race to the bottom as an overwhelmingly strong default ignores the complexity of real-world systems, the potential for adaptation and intervention, and the historical track record of similar predictions about complicated social systems.
Similarly, his ideas of things like ‘a truth seeking AI would keep us around’ seem to me like Elon grasping at straws and thinking poorly, but he’s trying.
The way I think about Elon is that he's very intelligent but essentially not open to any new ideas or capable of self-reflection if his ideas are wrong, except on technical matters: if he can't clearly follow the logic himself, on the first try, or there's a reason it would be uncomfortable or difficult to ignore it initially then he won't believe you, but he is smart.
Essentially, he got one good idea about AI risk into his head 10+ years ago and therefore says locally good things and isn't simply lying when he says them, but it doesn't hang together in his head in a consistent way (e.g. if he thought international stability and having good AI regulation was a good idea he wouldn't be supporting the candidate that wants to rethink all US alliances and would impair the federal government's ability to do anything new and complicated, with an e/acc as his running mate). In general, I think one of the biggest mental blind spots EA/rationalist types have is overestimating the coherence of people's plans for the future.
The Economist is opposed, in a quite bad editorial calling belief in the possibility of a catastrophic harm ‘quasi-religious’ without argument, and uses that to dismiss the bill, instead calling for regulations that address mundane harms. That’s actually it.
I find this especially strange almost to the point that I'm willing to call it knowingly bad faith. The Economist in the past has sympathetically interviewed Helen Toner, done deep dive investigations into mechanistic interpretability research that are at a higher level of analysis than I've seen from any other mainstream news publication, ran articles which acknowledge the soft consensus among tech workers and AI experts on the dangers, which included survey results on these so it's doubly difficult to dismiss is as "too speculative" or "scifi".
To state without elaboration that the risk is "quasi-religious" or "science fictional" when their own journalists have consistently said the opposite and provided strong evidence that the AI world generally agrees makes me feel like someone higher up changed their mind for some reason regardless of what their own authors think.
The one more concrete reference they gave was to very near term (like the next year) prospect of AI systems being used to assist in terrorism, which has indeed been slightly exaggerated by some, but to claim that there's no idea whatsoever about where these capabilities could be in 3 years is absurd given what they themselves have said in previous articles.
Without some explanation as to why they think genuine catastrophic misuse concerns are not relatively near term and relatively serious (e.g. explaining why they think we won't see autonomous agents that could play a more active role in terrorism if freely available) it just becomes the classic "if 2025 is real why isn't it 2025 now" fallacy.
The short argument I've been using is:
If you want to oppose the bill you as a matter of logical necessity have to believe some combination of,
- No significant near-term catastrophic AI risks exist that warrant this level of regulation.
- Significant near-term risks exist, but companies shouldn't be held liable for them (i.e. you're an extremist ancap)
- Better alternatives are available to address these risks.
- The bill will be ineffective or counterproductive in addressing these risks.
The best we get is vague hints that (1) is true from some tech leaders but e.g. google, OpenAI definitely doesn't believe (1), or we get vague pie in the sky appeals to (3) as if the federal government is working efficiently on frontier tech issues right now, or claims for (4) that either lie about the content of the bill, e.g. claiming it applies to small startups and academics, or say fearmongering towards (4) like every tech company in California will up-sticks and leave, or it will so impair progress that China will inevitably win, so the bill will not achieve its stated aims.
This seems like really valuable work! And while situational awareness isn't a sufficient condition for being able to fully automate many intellectual tasks, it seems like a necessary condition at least so this is already a much superior benchmark for 'intelligence' than e.g. MMLU.
I agree that this is a real possibility and in the table I did say at level 2,
Misspecified rewards / ‘outer misalignment’ / structural failures where systems don’t learn adversarial policies [2]but do learn to pursue overly crude and clearly underspecified versions of what we want, e.g. the production web or WFLL1.
From my perspective, it is entirely possible to have an alignment failure that works like this and occurs at difficulty level 2. This is still an 'easier' world than the higher levels because you can get killed in a much swifter and earlier way with far less warning in those worlds.
The reason I wouldn't put it at level 8 is because presumably the models are following a reasonable proxy for what we want if it generalizes well beyond human level, but this proxy is inadequate in some ways that become apparent later on. The level 8 says not that any misgeneralization occurs but that rapid, unpredictable misgeneralization occurs around the human level such that alignment techniques quickly break down.
In the scenario you describe, there'd be an opportunity to notice what's going on (after all you'd have superhuman AI that more or less does what it's told to help you predict future consequences of even more superhuman AI) and the failure occurs much later.
"OpenAI appears to subscribe to that philosophy [of 'bothsidesism']. Also there seems to be a 'popular opinion determines attention and truth' thing here?"
OpenAI's approach is well-intentioned but crude and might be counterproductive. The goal they should be aiming at is something best constructed as "have good moral and political epistemology", something people are notoriously bad at by default.
Being vaguely both sidesist is a solution you see a lot with human institutions who don't want to look biased so it's not an unusually bad solution by any means but not good enough for high stakes situations.
What should the goal be? Instead of just presenting "both sides", I think we should focus on making the AI acutely aware of the distinction between facts and values and especially in cases where there are values conflicts bringing that up. Making sure the model explicitly identifies and separates empirical claims from value judgments means that we can achieve better epistemics without resorting to false equivalences. Maybe for sufficiently unambiguous values that everyone shares we don't want to do this but I think you should make the model biased towards saying "if X is what you value then do Y" whenever possible.
"This is weird. Why should the model need to spend tokens affirming that the user can believe what they wish? If information changes someone's mind, that is a feature."
Once again I think what they're getting at is in principle good. I'd phrase it as the model should be biased towards being decision support orientated not persuasive. The strategy of writing persuasive content and then tacking on "but believe what you want!" is indeed a cheap hack that doesn't solve the underlying issue. It would probably be better for the model to explicitly say when it's being persuasive and when it's not and err on the side of not persuading whenever possible but always be "meta honest" and upfront about what it thinks. That way we can at least be more assured it's just being used for decision assistance when that's all we want.
If you go with an assumption of good faith then the partial, gappy RSPs we've seen are still a major step towards having a functional internal policy to not develop dangerous AI systems because you'll assume the gaps will be filled in due course. However, if we don't assume a good faith commitment to implement a functional version of what's suggested in a preliminary RSP without some kind of external pressure, then they might not be worth much more than the paper they're printed on.
But, even if the RSPs aren't drafted in good faith and the companies don't have a strong safety culture (which seems to be true of OpenAI judging by what Jan Leike said), you can still have the RSP commitment rule be a foundation for actually effective policies down the line.
For comparison, if a lot of dodgy water companies sign on to a 'voluntary compact' to develop some sort of plan to assess the risk of sewage spills then probably the risk is reduced by a bit, but it also makes it easier to develop better requirements later, for example by saying "Our new requirement is the same as last years but now you must publish your risk assessment results openly" and daring them to back out. You can encourage them to compete on PR by making their commitments more comprehensive than their opponents and create a virtuous cycle, and it probably just draws more attention to the plans than there was before.
Maybe we have different definitions of DSA: I was thinking of it in terms of 'resistance is futile' and you can dictate whatever terms you want because you have overwhelming advantage, not that you could eventually after a struggle win a difficult war by forcing your opponent to surrender and accept unfavorable terms.
If say the US of 1965 was dumped into post WW2 Earth it would have the ability to dictate whatever terms it wanted because it would be able to launch hundreds of ICBMS at enemy cities at will. If the real US of 1949 had started a war against the Soviets it would probably have been able to cripple an advance into western Europe but likely wouldn't have been able to get its bombers through to devastate enough of the soviet homeland with the few bombs they had.
Remember the soviets did just lose a huge percentage of their population and industry in WW2 and kept fighting. The fact that it's at all debatable who would have won if WW3 started in the late 1940s at all (see e.g. here) makes me think nuclear weapons weren't at that time a DSA producer.
In the late 1940s and early 1950s nuclear weapons did not provide an overwhelming advantage against conventional forces. Being able to drop dozens of ~kiloton range fission bombs in eastern European battlefields would have been devastating but not enough by itself to win a war. Only when you got to hundreds of silo launched ICBMs with hydrogen bombs could you have gotten a true decisive strategic advantage
He's the best person they could have gotten on the technical side but Paul's strategic thinking has been consistently clear eyed and realistic but also constructive, see for example this: www.alignmentforum.org/posts/fRSj2W4Fjje8rQWm9/thoughts-on-sharing-information-about-language-model
So to the extent that he'll have influence on general policy as well this seems great!
This whole thing reminds me of Scott Alexander's Pyramid essay. That seems like a really good case where it seems like there's a natural statistical reference class, seems like you can easily get a giant Bayes factor that's "statistically well justified", and to all the counterarguments you can say "well the likelihood is 1 in 10^5 that the pyramids would have a latitude that matches to the speed of light in m/s". That's a good reductio for taking even fairly well justified sounding subjective bayes factors at face value.
And I think that it's built into your criticism that because the problem is social and hidden evidence filtering going on, there will also tend to be an explanation on the meta-level too for why my coincidence finding is different from your coincidence finding.
Tom Davidson’s report: https://docs.google.com/document/d/1rw1pTbLi2brrEP0DcsZMAVhlKp6TKGKNUSFRkkdP_hs/edit?usp=drivesdk
My old 2020 post: https://www.lesswrong.com/posts/66FKFkWAugS8diydF/modelling-continuous-progress
In my analysis of Tom Davidson's "Takeoff Speeds Report," I found that the dynamics of AI capability improvement, as discussed in the context of a software-only singularity, align closely with the original simplified equation ( I′(t) = cI + f(I)I^2 ) from my 4 year old post on Modeling continuous progress. Essentially, that post describes how we switch from exponential to hyperbolic growth as the fraction of AI research done by AIs improves along a logistic curve. These are all features of the far more complex mathematical model in tom’s report.
In this equation, ( I ) represents the intelligence or capability of the AI system. It correlates to the cognitive output or the efficiency of the AI as described in the report, where the focus is on the software improvements contributing to the overall effectiveness of AI systems. The term ( cI ) in the equation can be likened to the constant external effort or input in improving AI systems, which is consistent with the ongoing research and development efforts mentioned in the report. This part of the equation represents the incremental improvements in AI capabilities due to human-led development efforts.
The second term in the equation, ( f(I)I^2 ), is particularly significant in understanding the relationship with the software-only singularity concept. Here, ( f(I) ) is a function that determines the extent to which the AI system can use its intelligence to improve itself, essentially a measure of recursive self-improvement (RSI). In the report, the discussion about a software-only singularity uses a similar concept, where the AI systems reach a point where their self-improvement significantly accelerates their capability growth. This is analogous to ( f(I) ) increasing, leading to a more substantial contribution of ( I^2 ) (the AI’s self-improvement efforts) to the overall rate of intelligence growth, ( I′(t) ). As the AI systems become more capable, they contribute more to their own development, a dynamic that both the equation and the report capture. The report has a ‘FLOP gap' from when AIs start to contribute to research at all to when they fully take over which is essentially the upper and lower bounds to fit the f(I) curve to. Otherwise, the overall rate of change is sharper in tom’s report as I ignored increasing investment and increasing compute in my model focussing only on software self improvement feedback loops.
One other thing I liked about Tom’s report is it's focus on relatively outside viewy bio anchors and epoch AIs direct approach estimates for what is needed for TAI.
Maybe this is an unreasonable demand, but one concern I have about all of these alleged attempts to measure the ability of an AI to automate scientific research, is that this feels like a situation where it's unusually slippery and unusually easy to devise a metric that doesn't actually capture what's needed to dramatically accelerate research and development. Ideally, I'd like a metric where we know, as a matter of necessity, that a very high score means that the system would be able to considerably speed up research.
For example, the direct approach estimation does have this property, where if you can replicate to a certain level of accuracy what a human expert would say over a certain horizon length, you do in some sense have to be able to match or replicate the underlying thinking that produced it, which means being able to do long horizon tasks, but of course, that's a very vague upper bound. It's not perfect, the Horizon Length metric might only cover the 90th percentile of tasks at each time scale. The remaining 10th percentile might contain harder, more important tasks necessary for AI progress
I think trying to anticipate and list in a task all the capabilities you think you need to automate scientific progress when we don't really know what those are will lead to a predictable underestimate of what's required.
I thought it was worth commenting here because to me the 3 way debate with Eliezer Yudkowsky, Nora Belrose, and Andrew Critch managed to collectively touch on just about everything that I think the common debate gets wrong about AI “doom” with the result that they’re all overconfident in their respective positions.
Starting with Eliezer and Nora’s argument. Her statement:
"Alien shoggoths are about as likely to arise in neural networks as Boltzmann brains are to emerge from a thermal equilibrium.”
To which Eliezer responds,
"How blind to 'try imagining literally any internal mechanism that isn't the exact thing you hope for' do you have to be -- to think that, if you erase a brain, and then train that brain solely to predict the next word spoken by nice people, it ends up nice internally?"
I agree that it’s a mistake to identify niceness with predicting nice behaviour, and I agree that Nora is overconfident in no generalisation failures as a result of making a similar mistake. If your model says it’s literally as unlikely as a boltzmann brain appearing from nowhere then something has gone wrong. But, I don’t think that her point is as straightforward as just conflating a nice internal mechanism with nice feedback. I'm going to try and explain what I think her argument is.
I think that Eliezer has an implicit model that there’s zillions of potential generalizations to predict niceness which a model could learn, that all are pretty much equally likely to get learned a priori, and actually being nice is just one of them so it’s basically impossible for RLHF to hit on it, so RLHF would require tremendous cosmic coincidences to work.
Maybe this is true in some sense for arbitrarily superintelligent AI. But, as Paul Christiano said, I think that this tells us not much about what to expect for “somewhat superhuman” AI. Which is what we care about for predicting whether we’ll see misalignment disasters in practice.
Rather, “actually learning to be nice” is how humans usually learn to predict nice behaviour. Of all the possible ways that generalisation from nice training could happen, this is privileged as a hypothesis somewhat, it stands out from the background haze of random mechanisms that could be learned.
If the reasons this strategy worked for humans are transferable to the LLM case (and that is highly arguable and unclear), then yes, it might be true that giving agents rewards for being nice causes them to internally develop a sort of pseudo-niceness representation that controls their behaviour and planning even up to superhuman levels, even out of distribution. It’s not for ‘literally no reason’ or 'by coincidence' or 'because of a map-territory conflation', but because its possible such a mechanism in the form of a model inductive bias really exists and we have some vague evidence in favor of it.
Okay, so what’s the internal mechanism that I’m imagining which gets us there? Here’s a sketch, based on an “easy world” outlined in my alignment difficulty post.
Suppose that (up to some level of competence that’s notably superhuman for most engineering tasks), LLMs just search over potential writers of text, with RLHF selecting from the space of agents that have goals only over text completion. They can model the world, but since they start out modelling text, that’s what their goals range over, even up to considerably superhuman competence at a wide range of tasks. They don’t want things in the real world, and only model it to get more accurate text predictions. Therefore, you can just ask RLHF’d GPT-10, “what’s the permanent alignment solution?”, and it’ll tell you.
People still sometimes say, “doesn’t this require us to get unreasonably, impossibly lucky with generalisation?”. No, it requires luck but you can’t say it’s unbelievable impossible luck just based on not knowing how generalisation works. I also think recent evidence (LLMs getting better at modelling the world without developing goals over it) suggests this world is a bit more likely than it seemed years ago as Paul Christiano argues here:
“I think that a system may not even be able to "want" things in the behaviorist sense, and this is correlated with being unable to solve long-horizon tasks. So if you think that systems can't want things or solve long horizon tasks at all, then maybe you shouldn't update at all when they don't appear to want things.”
But that's not really where we are at---AI systems are able to do an increasingly good job of solving increasingly long-horizon tasks. So it just seems like it should obviously be an update, and the answer to the original question
Could you give an example of a task you don't think AI systems will be able to do before they are "want"-y? At what point would you update, if ever? What kind of engineering project requires an agent to be want-y to accomplish it? Is it something that individual humans can do? (It feels to me like you will give an example like "go to the moon" and that you will still be writing this kind of post even once AI systems have 10x'd the pace of R&D.)
But, again, I’m not making the claim that this favourable generalisation that gets RLHF to work is likely, just that it’s not a random complex hypothesis with no evidence for it that’s therefore near-impossible.
Since we don’t know how generalisation works, we can’t even say “we should have a uniform prior over internal mechanisms which I can describe that could get high reward”. Rather, if you don’t know, you really just don’t know, and the mechanism involving actually learning to be nice to predict niceness, or actually staying in the domain you were initially trained on when planning, might be favoured by inductive biases in training.
But even if you disagree with me on that, the supposed mistake is not (just) as simple as literally conflating the intent of the overseers with the goals that the AI learns, rather there’s a thought that replicating the goals that produced the feedback and simply adopting them as your own is a natural, simple way to learn to predict what the overseer wants even up to fairly superhuman capabilities so it’s what will get learned by default even if it isn’t the globally optimal reward-maximiser. Is this true? Well, I don’t know, but it’s at least a more complicated mistake if false. This point has been made many times in different contexts, there's a summary discussion here that outlines 6 different presentations of this basic idea.
If I had to sum it up, I think that while Nora maybe confuses the map with the territory, Eliezer conflates ignorance with positive knowledge (from ‘we don’t know how generalisation works’ to ‘we should have a strong default uniform prior over every kind of mechanism we could name’).
Then there's Andrew Critch, who I think agrees with and understands the point I’ve just made (that Nora’s argument is not a simple mistake of the map for the territory), but then makes a far more overreaching and unjustifiable claim than Eliezer or Nora in response.
In the Nora/Eliezer case, they were both very confident in their respective models of AI generalisation, which is at least the kind of thing about which you could be extremely confident, should you have strong evidence (which I don’t think we do). Social science and futurism is not one of those things. Critch says,
" I think literally every human institution will probably fail or become fully dehumanized by sometime around (median) 2040."
The "multipolar chaos" prediction, which is that processes like a fast proliferating production web will demolish or corrupt all institutional opposition and send us to dystopia with near-certainty, I just don’t buy.
I’ve read his production web stories and also heard similar arguments from many people, and it’s hard to voice my objections as specific “here’s why your story can’t happen” (I think many of them are at least somewhat plausible, in fact), but I still think there’s a major error of reasoning going on here. I think it’s related to the conjunction fallacy, the sleepwalk bias and possibly not wanting to come across as unreasonably optimistic about our institutions.
Here’s one of the production web stories in brief but you can read it in full along with my old discussion here,
In the future, AI-driven management assistant software revolutionizes industries by automating decision-making processes, including "soft skills" like conflict resolution. This leads to massive job automation, even at high management levels. Companies that don't adopt this technology fall behind. An interconnected "production web" of companies emerges, operating with minimal human intervention and focusing on maximizing production. They develop a self-sustaining economy, using digital currencies and operating beyond human regulatory reach. Over time, these companies, driven by their AI-optimized objectives, inadvertently prioritize their production goals over human welfare. This misalignment leads to the depletion of essential resources like arable land and drinking water, ultimately threatening human survival, as humanity becomes unable to influence or stop these autonomous corporate entities.
My object-level response is to say something mundane along the lines of, I think each of the following is more or less independent and not extremely unlikely to occur (each is above 1% likely):
- Wouldn’t governments and regulators also have access to AI systems to aid with oversight and especially with predicting the future? Remember, in this world we have pseudo-aligned AI systems that will more or less do what their overseers want in the short term.
- Couldn’t a political candidate ask their (aligned) strategist-AI ‘are we all going to be killed by this process in 20 years’ and then make a persuasive campaign to change the public’s mind with this early in the process, using obvious evidence to their advantage
- If the world is alarmed by the expanding production web and governments have a lot of hard power initially, why will enforcement necessarily be ineffective? If there’s a shadow economy of digital payments, just arrest anyone found dealing with a rogue AI system. This would scare a lot of people.
- We’ve already seen pessimistic views about what AI regulations can achieve self-confessedly be falsified at the 98% level - there’s sleepwalk bias to consider. Stefan schubert: Yeah, if people think the policy response is "99th-percentile-in-2018", then that suggests their models have been seriously wrong. So maybe the regulations will be both effective, foresightful and well implemented with AI systems forseeing the long-run consequences of decisions and backing them up.
- What if the lead project is unitary and a singleton or the few lead projects quickly band together because they’re foresightful, so none of this race to the bottom stuff happens in the first place?
- If it gets to the point where water or the oxygen in the atmosphere is being used up (why would that happen again, why wouldn’t it just be easier for the machines to fly off into space and not have to deal with the presumed disvalue of doing something their original overseers didn’t like?) did nobody build in ‘off switches’?
- Even if they aren’t fulfilling our values perfectly, wouldn’t the production web just reach some equilibrium where it’s skimming off a small amount of resources to placate its overseers (since its various components are at least somewhat beholden to them) while expanding further and further?
And I already know the response is just going to be “Moloch wouldn’t let that happen..” and that eventually competition will mean that all of these barriers disappear. At this point though I think that such a response is too broad and proves too much. If you use the moloch idea this way it becomes the classic mistaken "one big idea universal theory of history" which can explain nearly any outcome so long as it doesn't have to predict it.
A further point: I think that someone using this kind of reasoning in 1830 would have very confidently predicted that the world of 2023 would be a horrible dystopia where wages for workers wouldn’t have improved at all because of moloch.
I agree that it’s somewhat easier for me to write a realistic science fiction story set in 2045 that’s dystopian compared to utopian, assuming pseudo-aligned AGI and no wars or other obvious catastrophic misuse. As a broader point, I along with the great majority of people, don’t really want this transition to happen either way, and there are many aspects of the ‘mediocre/utopian’ futures that would be suboptimal, so I get why the future forecasts don’t ever look normal or low-risk.
But I think all this speculation tells us very little with confidence what the default future looks like. I don’t think a dystopian economic race to the bottom is extremely unlikely, and with Matthew Barnett I am worried about what values and interests will influence AI development and think the case for being concerned about whether our institutions will hold is strong.
But saying that moloch is a deterministic law of nature such that we can be near-certain of the outcome is not justifiable. This is not even the character of predictions about which you can have such certainty.
Also, in this case I think that a reference class/outside view objection that this resembles failed doomsday predictions of the past is warranted.
I don’t agree that these objections have much weight when we’re concerned about misaligned AI takeover as that has a clear, singular obvious mechanism to be worried about.
However, for ‘molochian race to the bottom multipolar chaos’, it does have the characteristic of ignoring or dismissing endogenous responses, society seeing what’s happening and deciding not to go down that path, or just unknown unknowns that we saw with past failed doomsday predictions. This I see as absolutely in the same reference class as people who in past decades were certain of overpopulation catastrophes or the people now who are certain or think a civilizational collapse from the effects of climate change are likely. It’s taking current trends and drawing mental straight lines on them to extreme heights decades in the future.
I also expect that if implemented the plans in things like Project 2025 would impair the ability of the government to hire civil servants who are qualified and probably just degrade the US Government's ability to handle complicated new things of any sort across the board.
If you want a specific practical example of the difference between the two: we now have AIs capable of being deceptive when not specifically instructed to do so ('strategic deception') but not developing deceptive power-seeking goals completely opposite what the overseer wants of them ('deceptive misalignment'). This from Apollo research on Strategic Deception is the former not the latter,
https://www.apolloresearch.ai/research/summit-demo
Doc Xardoc reports back on the Chinese alignment overview paper that it mostly treats alignment as an incidental engineering problem, at about a 2.5 on a 1-10 scale with Yudkowsky being 10
I'm pretty sure Yudkowsky is at around an 8.5 actually (I think he thinks it's not impossible in principle for ML like systems but maybe it is). 10 would be impossible in principle.
I think that aside from the declaration and the promise for more summits the creation of the AI Safety Institute and its remit are really good, explicitly mentioning auto-replication and deception evals and planning to work with the likes of Apollo Research and ARC evals to test for:
Also, NIST is proposing something similar.
I find this especially interesting because we now know that in the absence of any empirical evidence of any instance of deceptive alignment at least one major government is directing resources to developing deception evals anyway. If your model of politics doesn't allow for this or considers it unlikely then you need to reevaluate it as Matthew Barnett said.
Additionally, the NIST consortium and AI Safety Institute both strike me as useful national-level implementations of the 'AI risk evaluation consortium' idea proposed by TFI.
King Charles notes (0:43 clip) that AI is getting very powerful and that dealing with it requires international coordination and cooperation. Good as far as it goes.
I find it amusing that for the first time in hundreds of years a king is once again concerned about superhuman non-physical threats (at least if you're a mathematical platonist about algorithms and predict instrumental convergence as a fundamental property of powerful minds) to his kingdom and the lives of his subjects. :)
I don't like the term pivotal act because it implies without justification that the risk elimination has to be a single action. Depending on the details of takeoff speed that may or may not be a requirement but if the final speed is months or longer then almost certainly there will be many actions taken by humans + AI of varying capabilities that together incrementally reduce total risk to low levels. I talk about this in terms of 'positively transformative AI' as the term doesn't bias you towards thinking this has to be a single action, even if nonviolent.
Seeing the risk reduction as a single unitary action, like seeing it as a violent overthrow of all the world's governments, also makes the term seem more authoritarian, crazy, fantastical and off-putting to anyone involved in real world politics so I'd recommend that in our thinking we make both the change you suggest and stop thinking of it as necessarily one action.
This as a general phenomenon (underrating strong responses to crises) was something I highlighted (calling it the Morituri Nolumus Mori) with a possible extension to AI all the way back in 2020. And Stefan Schubert has talked about 'sleepwalk bias' even earlier than that as a similar phenomenon.
https://twitter.com/davidmanheim/status/1719046950991938001
https://twitter.com/AaronBergman18/status/1719031282309497238
I think the short explanation as to why we're in some people's 98th percentile world so far (and even my ~60th percentile) for AI governance success is that if was obvious to you how transformative AI would be over the next couple of decades in 2021 and yet nothing happened, it seems like governments are just generally incapable.
The fundamental attribution error makes you think governments are just not on the ball and don't care or lack the capacity to deal with extinction risks, rather than decision makers not understanding obvious-to-you evidence that AI poses an extinction risk. Now that they do understand, they will react accordingly. It doesn't meant that they will react well necessarily, but they will act on their belief in some manner.
Lomborg is massively overconfident in his predictions but not exactly less wrong than the implicit mainstream view that the economic impacts will definitely be ruinous enough to justify expensive policies.
It's very hard to know, the major problem is just that the existing climate econ models make so many simplifying assumptions that they're near-useless except for giving pretty handwavy lower bounds on damage, especially when the worst risks to worry about are in correlated disasters and tail risks, and Lomborg makes the mistake of taking them completely literally. I discussed this at length a couple of years ago and John Halstead later wrote a book-length report on what the climate impacts literature can and can't tell us.
Roon also lays down the beats.
For those who missed the reference
The ARC evals showing that when given help and a general directive to replicate a GPT-4 based agent was able to figure out that it ought to lie to a TaskRabbit worker is an example of it figuring out a self-preservation/power-seeking subgoal which is on the road to general self-preservation. But it doesn't demonstrate an AI spontaneously developing self-preservation or power-seeking, as an instrumental subgoal to something that superficially has nothing to do with gaining power or replicating.
Of course we have some real-world examples of specification-gaming like you linked in your answer: those have always existed and we see more 'intelligent' examples like AIs convinced of false facts trying to convince people they're true.
There's supposedly some evidence here that we see power-seeking instrumental subgoals developing spontaneously but how spontaneous this actually was is debatable so I'd call that evidence ambiguous since it wasn't in the wild.
>APS is less understood and poorly forecasted compared to AGI.
I should clarify that I was talking about the definition used by forecasts like the Direct Approach methodology and/or the definition given in the metaculus forecast or in estimates like the Direct Approach. The latter is roughly speaking, capability sufficient to pass a hard adversarial Turing tests and human-like capabilities on enough intellectual tasks as measured by certain tests. This is something that can plausibly be upper bounded by the direct approach methodology which aims to predict when an AI could get a negligible error in predicting what a human expert would say over a specific time horizon. So this forecast is essentially a forecast of 'human-expert-writer-simulator AI', and that is the definition that's used in public elicitations like the metaculus forecasts.
However, I agree with you that while in some of the sources I cite that's how the term is defined it's not what the word denotes (just generality, which e.g. GPT-4 plausibly is for some weak sense of the word), and you also don't get from being able to simulate the writing of any human expert to takeover risk without making many additional assumptions.
I guess it is down to Tyler's personal opinion, but would he accept asking IR and defense policy experts on the chance of a war with China as an acceptable strategy or would he insist on mathematical models of their behaviors and responses? To me it's clearly the wrong tool, just as in the climate impacts literature we can't get economic models of e.g. how governments might respond to waves of climate refugees but can consult experts on it.
I recently held a workshop with PIBBSS fellows on the MTAIR model and thought some points from the overall discussion were valuable:
The discussants went over various scenarios related to AI takeover, including a superficially aligned system being delegated lots of power and gaining resources by entirely legitimate means, a WFLL2-like automation failure, and swift foom takeover. Some possibilities involved a more covert, silent coup where most of the work was done through manipulation and economic pressure. The concept of "$1T damage" as an intermediate stage to takeover appeared to be an unnatural fit with some of these diverse scenarios. There was some mention of whether mitigation or defensive spending should be considered as part of that $1T figure.
Alignment Difficulty and later steps merge many scenarios
The discussants interpreted "alignment is hard" (step 3) as implying that alignment is sufficiently hard that (given that APS is built), at least one APS is misaligned somewhere, and also that there's some reasonable probability that any given deployed APS is unaligned. This is the best way of making the whole statement deductively valid.
However, proposition 3 being true doesn't preclude the existence of other aligned APS AI (hard alignment and at least one unaligned APS might mean that there are leading conscientious aligned APS projects but unaligned reckless competitors). This makes discussion of the subsequent questions harder, as we have to condition on there possibly being aligned APS present as well which might reduce the risk of takeover.
This means that when assessing proposition 4, we have to condition on some worlds where aligned APS has already been deployed and used for defense, some where there have been warning shots and strong responses without APS, some where misaligned APS emerges out of nowhere and FOOMs too quickly for any response, and a slow takeoff where nonetheless every system is misaligned and there is a WFLL2 like takeover attempt, and add up the chance of large scale damage in all of these scenarios, conditioning on their probability, which makes coming to an overall answer to 4 and 5 challenging.
Definitions are value-laden and don't overlap: TAI, AGI, APS
We differentiated between Transformative AI (TAI), defined by Karnofsky, Barnett and Cotra entirely by its impact on the world, which can either be highly destructive or drive very rapid economic growth; General AI (AGI), defined through a variety of benchmarks including passing hard adversarial Turing tests and human-like capabilities on enough intellectual tasks; and APS, which focuses on long-term planning and human-like abilities only on takeover-relevant tasks. We also mentioned Paul Christiano's notion of the relevant metric being AI 'as economically impactful as a simulation of any human expert' which technically blends the definitions of AGI and TAI (since it doesn't necessarily require very fast growth but implies it strongly). Researchers disagree quite a lot on even which of these are harder: Daniel Kokotaljo has argued that APS likely comes before TAI and maybe even before (the Matthew Barnett definition of) AGI, while e.g. Barnett thinks that TAI comes after AGI with APS AI somewhere in the middle (and possibly coincident with TAI).
In particular, some definitions of ‘AGI’, i.e. human-level performance on a wide range of tasks, could be much less than what is required for APS depending on what the specified task range is. If the human-level performance is only on selections of tasks that aren’t useful for outcompeting humans strategically (which could still be very many tasks, for example, human-level performance on everything that requires under a minute of thinking), the 'AGI system' could almost entirely lack the capabilities associated with APS. However, most of the estimates that could be used in a timelines estimate will revolve around AGI predictions (since they will be estimates of performance or accuracy benchmarks), which we risk anchoring on if we try to adjust them to predict the different milestones of APS.
In general it is challenging to use the probabilities from one metric like TAI to inform other predictions like APS, because each definition includes many assumptions about things that don't have much to do with AI progress (like how qualitatively powerful intelligence is in the real world, what capabilities are needed for takeover, what bottlenecks are there to economic automation or research automation etc.) In other words, APS and TAI are value-laden terms that include many assumptions about the strategic situation with respect to AI takeover, world economy and likely responses.
APS is less understood and more poorly forecasted compared to AGI. Discussants felt the current models for AGI can't be easily adapted for APS timelines or probabilities. APS carries much of the weight in the assessment due to its specific properties: i.e. many skeptics might argue that even if AGI is built, things which don't meet the definition of APS might not be built.
Alignment and Deployment Decisions
Several discussants suggested splitting the model’s third proposition into two separate components: one focusing on the likelihood of building misaligned APS systems (3a) and the other on the difficulty of creating aligned ones (3b). This would allow a more nuanced understanding of how alignment difficulties influence deployment decisions. They also emphasized that detection of misalignment would impact deployment, which wasn't sufficiently clarified in the original model.
Advanced Capabilities
There was a consensus that 'advanced capabilities' as a term is too vague. The discussants appreciated the attempt to narrow it down to strategic awareness and advanced planning but suggested breaking it down even further into more measurable skills, like hacking ability, economic manipulation, or propaganda dissemination. There are, however, disagreements regarding which capabilities are most critical (which can be seen as further disagreements about the difficulty of APS relative to AGI).
If strategic awareness comes before advanced planning, we might see AI systems capable of manipulating people, but not in ways that greatly exceed human manipulative abilities. As a result, these manipulations could potentially be detected and mitigated and even serve as warning signs that lower total risk. On the other hand, if advanced capabilities develop before strategic awareness or advanced planning, we could encounter AI systems that may not fully understand the world or their position in it, nor possess the ability to plan effectively. Nevertheless, these systems might still be capable of taking single, highly dangerous actions, such as designing and releasing a bioweapon.
Outside View & Futurism Reliability
We didn't cover the outside view considerations extensively, but various issues under the "accuracy of futurism" umbrella arose which weren't specifically mentioned.
The fact that markets don't seem to have reacted as if Transformative AI is a near-term prospect, and the lack of wide scale scrutiny and robust engagement with risk arguments (especially those around alignment difficulty), were highlighted as reasons to doubt this kind of projection further.
The Fermi Paradox implies a form of X-risk that is self-destructive and not that compatible with AI takeover worries, while market interest rates also push the probability of such risks downward. The discussants recommended placing more weight on outside priors than we did in the default setting for the model, suggesting a 1:1 ratio compared to the model's internal estimations.
Discussants also agreed with the need to balance pessimistic surviva- is-conjunctive views and optimistic survival-is-disjunctive views, arguing that the Carlsmith model is biased towards optimism and survival being disjunctive but that the correct solution is not to simply switch to a pessimism-biased survival is conjunctive model in response.
Difficult to separate takeover from structural risk
There's a tendency to focus exclusively on the risks associated with misaligned APS systems seeking power, which can introduce a bias towards survival being predicated solely on avoiding APS takeover. However, this overlooks other existential risk scenarios that are more structural. There are potential situations without agentic power-seeking behavior but characterized by rapid changes could for less causally clear reasons include technological advancements or societal shifts that may not necessarily have a 'bad actor' but could still spiral into existential catastrophe. This post describes some of these scenarios in more detail.
This is a serious problem, but it is under active investigation at the moment, and the binary of regulation or pivotal act is a false dichotomy. Most approaches that I've heard of rely on some combination of positively transformative AI tech (basically lots of TAI technologies that reduce risks bit by bit, overall adding up to an equivalent of a pivotal act) and regulation to give time for the technologies to be used to strengthen the regulatory regime in various ways or improve the balance of defense over offense, until eventually we transition to a totally secure future: though of course this assumes at least (somewhat) slow takeoff.
You can see these interventions as acting on the conditional probabilities 4) and 5) in our model by driving down the chance that assuming misaligned APS is deployed, it can cause large-scale disasters.
4) Misaligned APS systems will be capable of causing a large global catastrophe upon deployment,
5) The human response to misaligned APS systems causing such a catastrophe will not be sufficient to prevent it from taking over completely,
6) Having taken over, the misaligned APS system will destroy or severely curtail the potential of humanity.
This hasn't been laid out in lots of realistic detail yet not least because most AI governance people are currently focused on near-term actions like making sure the regulations are actually effective, because that's the most urgent task. But this doesn't reflect a belief that regulations alone are enough to keep us safe indefinitely.
Holden Karnofsky has written on this problem extensively,
- https://www.alignmentforum.org/posts/vZzg8NS7wBtqcwhoJ/nearcast-based-deployment-problem-analysis
- https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/
- https://www.lesswrong.com/posts/Fbk9H6ipfybHyqjrp/a-playbook-for-ai-risk-reduction-focused-on-misaligned-ai
- https://www.lesswrong.com/posts/jwhcXmigv2LTrbBiB/success-without-dignity-a-nearcasting-story-of-avoiding
Oh, we've been writing up these concerns for 20 years and no one listens to us.' My view is quite different. I put out a call and asked a lot of people I know, well-informed people, 'Is there any actual mathematical model of this process of how the world is supposed to end?'...So, when it comes to AGI and existential risk, it turns out as best I can ascertain, in the 20 years or so we've been talking about this seriously, there isn't a single model done.
I think that MTAIR plausibly is a model of the 'process of how the world is supposed to end', in the sense that it runs through causal steps where each individual thing is conditioned on the previous thing (APS is developed, APS is misaligned, given misalignment it causes damage on deployment, given that the damage is unrecoverable), and for some of those inputs your probabilities and uncertainty distribution could itself come from a detailed causal model (e.g. you can look at the Direct Approach for the first two questions.
For the later questions, like e.g. what's the probability that an unaligned APS can inflict large disasters given that it is deployed, we can enumerate ways that it could happen in detail but to assess their probability you'd need to do a risk assessment with experts not produce a mathematical model.
E.g. you wouldn't have a "mathematical model" of how likely a US-China war over Taiwan is, you'd do wargaming and ask experts or maybe superforecasters. Similarly, for the example that he gave which was COVID there was a part of this that was a straightforward SEIR model and then a part that was more sociological talking about how the public response works (though of course a lot of the "behavioral science" then turned out to be wrong!).
So a correct 'mathematical model of the process' if we're being fair, would use explicit technical models for technical questions and for sociological/political/wargaming questions you'd use other methods. I don't think he'd say that there's no 'mathematical model' of nuclear war because while we have mathematical models of how fission and fusion works, we don't have any for how likely it is that e.g. Iran's leadership decides to start building nuclear weapons.
I think Tyler Cowen would accept that as sufficiently rigorous in that domain, and I believe that the earlier purely technical questions can be obtained from explicit models. One addition that could strengthen the model is to explicitly spell out different scenarios for each step (e.g. APS causes damage via autonomous weapons, economic disruption, etc). But the core framework seems sufficient as is, and also those concerns have been explained in other places.
What do you think?
The alignment difficulty scale is based on this post.
I really like this post and think it's a useful addendum to my own alignment difficulty scale (making it 2D, essentially). But I think I was conceptualizing my scale as running along the diagonal line you provide from GPT-4 to sovereign AI. But I think your way of doing it is better on reflection.
In my original post when I suggested that the 'target' level of capability we care about is the capability level needed to build positively transformative AI (pTAI), which is essentially the 'minimal aligned AGI that can do a pivotal act' notion but is more agnostic about whether it will be a unitary agentic system or many systems deployed over a period.
I think that what most people talk about when they talk about alignment difficulty isn't how hard the problem 'ultimately' is but rather how hard the problem is that we need to solve, with disagreements also being about e.g. how capable an AI you need for various pivotal/positively transformative acts.
I didn't split these up because I think that in a lot of people's minds the two run together in a fairly unprincipled way, but if we want a scale that corresponds to real things in the world having a 2D chart like this is better.
Update
This helpful article by Holden Karnofsky also describes an increasing scale of alignment difficulty, although it's focused on a narrower range of the scale than mine (his scale covers 4-7) and is a bit more detailed about the underlying causes of the misalignment. Here's how my scale relates to his:
The "playing the training game" threat model, where systems behave deceptively only to optimize in-episode reward, corresponds to an alignment difficulty level of 4 or higher. This is because scalable oversight without interpretability tools (level 4) should be sufficient to detect and address this failure mode. The AI may pretend to be helpful during training episodes, but oversight exposing it to new situations will reveal its tendency toward deception.
(Situationally aware) Deception by default corresponds to a difficulty level of 6. If misaligned AIs form complex inner goals and engage in long-term deception, then scalable oversight alone will not catch intentionally deceptive systems that can maintain consistent deceitful behavior. Only interpretability tools used as part of the oversight process (level 6) give us the ability to look inside the system and identify deceptive thought patterns and tendencies.
Finally, the gradient hacking threat model, where AIs actively manipulate their training to prevent alignment, represents an alignment difficulty of 7 or higher. Even interpretability-based oversight can be defeated by sufficiently sophisticated systems that alter their internals to dodge detection.
I think that, on the categorization I provided,
'Playing the training game' at all corresponds to an alignment difficulty level of 4 because better than human behavioral feedback and oversight can reveal it and you don't need interpretability.
(Situationally aware) Deception by default corresponds to a difficulty level of 6 because if it's sufficiently capable no behavioral feedback will work and you need interpretability-based oversight
Gradient hacking by default corresponds to a difficulty level of 7 because the system will also fight interpretability based oversight and you need to think of something clever probably through new research at 'crunch time'.
You're right, I've reread the section and that was a slight misunderstanding on my part.
Even so I still think it falls at a 7 on my scale as it's a way of experimentally validating oversight processes that gives you some evidence about how they'll work in unseen situations.
In the sense that there has to be an analogy between low and high capabilities somewhere, even if at the meta level.
This method lets you catch dangerous models that can break oversight processes for the same fundamental reasons as less dangerous models, not just for the same inputs.
Excellent! In particular, it seems like oversight techniques which can pass tests like these could work in worlds where alignment is very difficult, so long as AI progress doesn't involve a discontinuity so huge that local validity tells you nothing useful (such that there are no analogies between low and high capability regimes).
I'd say this corresponds to 7 on my alignment difficulty table.
There's a trollish answer to this point (that I somewhat agree with) which is to just say: okay, let's adopt moral uncertainty over all of the philosophically difficult premises too, so let's say there's only a 1% chance that raw intensity of pain matters and 99% that you need to be self reflective in certain ways to have qualia and suffer in a way that matters morally, or you should treat it as scaling with cortical neurons, or only humans matter.
...and probably the math still works out very unfavorably.
I say trollish because a decision procedure like this strikes me as likely to swamp and overwhelm you with way too many different considerations pointing in all sorts of crazy directions and to be just generally unworkable so I feel like something has to be going wrong here.
Still, I do feel like the fact that the answer is non-obvious in this way and does rely on philosophical reflection means you can't draw many deep abiding conclusions about human empathy or the "worthiness" of human civilization (whatever that really means) from how we treat fish
Today, Anthropic, Google, Microsoft and OpenAI are announcing the formation of the Frontier Model Forum, a new industry body focused on ensuring safe and responsible development of frontier AI models. The Frontier Model Forum will draw on the technical and operational expertise of its member companies to benefit the entire AI ecosystem, such as through advancing technical evaluations and benchmarks, and developing a public library of solutions to support industry best practices and standards.
The core objectives for the Forum are:
- Advancing AI safety research to promote responsible development of frontier models, minimize risks, and enable independent, standardized evaluations of capabilities and safety.
- Identifying best practices for the responsible development and deployment of frontier models, helping the public understand the nature, capabilities, limitations, and impact of the technology.
- Collaborating with policymakers, academics, civil society and companies to share knowledge about trust and safety risks.
- Supporting efforts to develop applications that can help meet society’s greatest challenges, such as climate change mitigation and adaptation, early cancer detection and prevention, and combating cyber threats.
This seems overall very good at first glance, and then seems much better once I realized that Meta is not on the list. There's nothing here that I'd call substantial capabilities acceleration (i.e. attempts to collaborate on building larger and larger foundation models, though some of this could be construed as making foundation models more useful for specific tasks). Sharing safety-capabilities research like better oversight or CAI techniques is plausibly strongly net positive even if the techniques don't scale indefinitely. By the same logic, while this by itself is nowhere near sufficient to get us AI existential safety if alignment is very hard (and could increase complacency), it's still a big step in the right direction.
adversarial robustness, mechanistic interpretability, scalable oversight, independent research access, emergent behaviors and anomaly detection. There will be a strong focus initially on developing and sharing a public library of technical evaluations and benchmarks for frontier AI models.
The mention of combating cyber threats is also a step towards explicit pTAI.
BUT, crucially, because Meta is frozen out we can know both that this partnership isn't toothless, represents a commitment to not do the most risky and antisocial things Meta presumably doesn't want to give up, and the fact that they're the only major AI company in the US to not join will be horrible PR for them as well.
Once I am caught up I intend to get my full Barbieheimer on some time next week, whether or not I do one right after the other. I’ll respond after. Both halves matter – remember that you need something to protect.
That's why it has to be Oppenheimer first, then Barbie. :)
When I look at the report, I do not see any questions about 2100 that are more ‘normal’ such as the size of the economy, or population growth, other than the global temperature, which is expected to be actual unchanged from AGI that is 75% to arrive by then. So AGI not only isn’t going to vent the atmosphere and boil the oceans or create a Dyson sphere, it also isn’t going to design us superior power plants or forms of carbon capture or safe geoengineering. This is a sterile AGI.
This doesn't feel like much of a slam dunk to me. If you think very transformative AI will be highly distributed, safe by default (i.e. 1-3 on the table) and arise on the slowest end of what seems possible, then maybe we don't coordinate to radically fix the climate and we just use TAI to adapt well individually, decarbonize and get fusion power and spaceships but don't fix the environment or melt the earth and just kind of leave it be because we can't coordinate well enough to agree on a solution. Honestly that seems not all that unlikely, assuming alignment, slow takeoff and a mediocre outcome.
If they'd asked about GDP and they'd just regurgitated the numbers given by the business as usual UN forecast after just being queried about AGI, then it would be a slam dunk that they're not thinking it through (unless they said something very compelling!). But to me while parts of their reasoning feel hard to follow there's nothing clearly crazy.
The view that the Superforecasters take seems to be something like "I know all these benchmarks seem to imply we can't be more than a low number of decades off powerful AI and these arguments and experiments imply super-intelligence should be soon after and could be unaligned, but I don't care, it all leads to an insane conclusion, so that just means the benchmarks are bullshit, or that one of the 'less likely' ways the arguments could be wrong is correct. (Note that they didn't disagree on the actual forecasts of what the benchmark scores would be, only their meaning!)
One thing I can say is that it very much reminds me of Da Shi in the novel Three Body Problem (who - and I know this is fictional evidence - ended up being entirely right in this interaction that the supposed 'miracle' of the CMB flickering was a piece of trickery)
You think that's not enough for me to worry about? You think I've got the energy to gaze at stars and philosophize?"
"You're right. All right, drink up!"
"But, I did indeed invent an ultimate rule."
"Tell me."
"Anything sufficiently weird must be fishy."
"What... what kind of crappy rule is that?"
"I'm saying that there's always someone behind things that don't seem to have an explanation."
"If you had even basic knowledge of science, you'd know it's impossible for any force to accomplish the things I experienced. Especially that last one. To manipulate things at the scale of the universe—not only can you not explain it with our current science, I couldn't even imagine how to explain it outside of science. It's more than supernatural. It's super-I-don't-know-what...."
"I'm telling you, that's bullshit. I've seen plenty of weird things."
"Then tell me what I should do next."
"Keep on drinking. And then sleep."
I think that before this announcement I'd have said that OpenAI was at around a 2.5 and Anthropic around a 3 in terms of what they've actually applied to existing models (which imo is fine for now, I think that doing more to things at GPT-4 capability levels is mostly wasted effort in terms of current safety impacts), though prior to the superalignment announcement I'd have said openAI and anthropic were both aiming at a 5, i.e. oversight with research assistance, and Deepmind's stated plan was the best at a 6.5 (involving lots of interpretability and some experiments). Now OpenAI is also aiming at a 6.5 and Anthropic now seems to be the laggard and still at a 5 unless I've missed something.
However the best currently feasible plan is still slightly better than either. I think e.g. very close integration of the deception and capability evals from the ARC and Apollo research teams into an experiment workflow isn't in either plan and should be, and would bump either up to a 7.
I don't see how anyone can have high justifiable confidence that it's zero instead of epsilon, given our general state of knowledge/confusion about philosophy of mind/consciousness, axiology, metaethics, metaphilosophy.
I tend to agree with Zvi's conclusion although I also agree with you that I don't know that it's definitely zero. I think it's unlikely (subjectively like under a percent) that the real truth about axiology says that insects in bliss are an absolute good, but I can't rule it out like I can rule out winning the lottery because no-one can trust reasoning in this domain that much.
What I'd say is just in general in 'weird' domains (AI Strategy, thinking about longtermist prioritization, metaethics) because the stakes are large and the questions so uncertain, you run into a really large number of "unlikely but not really unlikely enough to honestly call it a pascal's mugging" considerations, things you'd subjectively say are under 1% likely but over one in a million or so. I think the correct response in these uncertain domains is to mostly just ignore them like you'd ignore things that are under one in a billion in a more certain and clear domain like construction engineering.
The taskforce represents a startup government mindset that makes me optimistic
I would say it's not just potentially a startup government mindset in the abstract but rather an attempt to repeat a specific, preexisting highly successful example of startup government, namely the UK's covid vaccine task force which was name checked in the original Foundation model task force announcement.
That was also fast-moving attempt to solve a novel problem that regular scientific institutions were doing badly at and it substantially beat expectations, and was run under an administration that has a lot of overlap with the current administration, with the major exception being a more stable and reasonable PM at the top (Sunak not Boris) and no Dominic Cummings involved.
This is plausibly true for some solutions this research could produce like e.g. some new method of soft optimization, but might not be in all cases.
For levels 4-6 especially the pTAI that's capable of e.g. automating alignment research or substantially reducing the risks of unaligned TAI might lack some of the expected 'general intelligence' of AIs post SLT and be too unintelligent for techniques that rely on it having complete strategic awareness, self-reflection, a consistent decision theory, the ability to self improve or other post SLT characteristics.
One (unrealistic) example, if we have a technique for fully loading the human CEV into a superintelligence ready to go that works for levels 8 or 9, that may well not help at all with improving scalable oversight of non-superintelligent pTAI which is incapable of representing the full human value function.
Thanks for this! Will add to the post, was looking for sources on this scenario
Will this be an alignment effort that can work, an alignment effort that cannot possibly work, an initial doomed effort capable of pivoting, or a capabilities push in alignment clothing?
As written, and assuming that the description is accurate and means what I think it means this seems like a very substantial advance in marginal safety. I'd say it seems like it's aimed at a difficulty level of 5 to 7 on my table,
https://www.lesswrong.com/posts/EjgfreeibTXRx9Ham/ten-levels-of-ai-alignment-difficulty#Table
I.e. experimentation on dangerous systems and interpretability play some role but the main thrust is automating alignment research and oversight, so maybe I'd unscientifically call it a 6.5, which is a tremendous step up from the current state of things (2.5) and would solve alignment in many possible worlds.
As ever though, if it doesn't work then it doesn't just not work, but rather will just make it look like the problem's been solved and advance capabilities.
deception induced by human feedback does not require strategic awareness - e.g. that thing with the hand which looks like it's grabbing a ball but isn't is a good example. So human-feedback-induced deception is more likely to occur, and to occur earlier in development, than deception from strategic awareness
The phenomenon that a 'better' technique is actually worse than a 'worse' technique if both are insufficient is something I talk about in a later section of the post and I specifically mention RLHF. I think this holds true in general throughout the scale, e.g. Eliezer and Nate have said that even complex interpretability-based oversight with robustness testing and AI research assistance is also just incentivizing more and better deception, so this isn't unique to RLHF.
But I tend to agree with Richard's view in his discussion with you under that post that while if you condition on deception occurring by default RLHF is worse than just prompting (i.e. prompting is better in harder worlds), RLHF is better than just prompting in easy worlds. I also wouldn't call non-strategically aware pursuit of inaccurate proxies for what we want 'deception', because in this scenario the system isn't being intentionally deceptive.
In easy worlds, the proxies RLHF learns are good enough in practice and cases like the famous thing with the hand which looks like it's grabbing a ball but isn't just disappear if you're diligent enough with how you provide feedback. In that world, not using RLHF would get systems pursuing cruder and worse proxies for what we want that fail often (e.g. systems just overtly lie to you all the time, say and do random things etc.). I think that's more or less the situation we're in right now with current AIs!
If the proxies that RLHF ends up pursuing are in fact close enough, then RLHF works and will make systems behave more reliably and be harder to e.g. jailbreak or provoke into random antisocial behavior than with just prompting. I did flag in a footnote that the 'you get what you measure' problem that RLHF produces could also be very difficult to deal with for structural or institutional reasons.
Next up, I'd put "Experiments with Potentially Catastrophic Systems to Understand Misalignment" as 4th-hardest world. If we can safely experiment with potentially-dangerous systems in e.g. a sandbox, and that actually works (i.e. the system doesn't notice when it's in testing and deceptively behave itself, or otherwise generalize in ways the testing doesn't reveal), then we don't really need oversight tools in the first place.
I'm assuming you meant fourth-easiest here not fourth hardest. It's important to note that I'm not here talking about testing systems to see if they misbehave in a sandbox and then if they don't assuming you've solved the problem and deploying. Rather, I'm talking about doing science with models that exhibit misaligned power seeking, with the idea being that we learn general rules about e.g. how specific architectures generalize, why certain phenomena arise etc. that are theoretically sound and we expect to hold true even post deployment with much more powerful systems. Incidentally this seems quite similar to what the OpenAI superalignment team is apparently planning.
So it's basically, "can we build a science of alignment through a mix of experimentation and theory". So if e.g. we study in a lab setting a model that's been fooled into thinking it's been deployed, then commits a treacherous turn, enough times we can figure out the underlying cause of the behavior and maybe get new foundational insights? Maybe we can try to deliberately get AIs to exhibit misalignment and learn from that. It's hard to anticipate in advance what scientific discoveries will and won't tell you about systems, and I think we've already seen cases of experiment-driven theoretical insights, like simulator theory, that seem to offer new handles for solving alignment. How much quicker and how much more useful will these be if we get the chance to experiment on very powerful systems?
You're right that I think this is more useful as an unscientific way for (probably less technical governance and strategy people) to orientate towards AI alignment than for actually carving up reality. I wrote the post with that audience and that framing in mind. By the same logic, your chart of how difficult various injuries and diseases are to fix would be very useful e.g. as a poster in a military triage tent even if it isn't useful for biologists or trained doctors.
However, while I didn't explore the idea much I do think that it is possible to cash this scale out as an actual variable related to system behavior, something along the lines of 'how adversarial are systems/how many extra bits of optimization over and above behavioral feedback are needed'. See here for further discussion on that. Evan Hubinger also talked in a bit more detail about what might be computationally different about ML models in low vs high adversarialness worlds here.
Very nice! I'd say this seems like it's aimed at a difficulty level of 5 to 7 on my table,
https://www.lesswrong.com/posts/EjgfreeibTXRx9Ham/ten-levels-of-ai-alignment-difficulty#Table
I.e. experimentation on dangerous systems and interpretability play some role but the main thrust is automating alignment research and oversight, so maybe I'd unscientifically call it a 6.5, which is a tremendous step up from the current state of things (2.5) and would solve alignment in many possible worlds.