Posts
Comments
I found some prior relevant work and tagged them in https://www.lesswrong.com/tag/successor-alignment. I found the top few comments on https://www.lesswrong.com/posts/axKWaxjc2CHH5gGyN/ai-will-not-want-to-self-improve#comments and https://www.lesswrong.com/posts/wZAa9fHZfR6zxtdNx/agi-systems-and-humans-will-both-need-to-solve-the-alignment#comments helpful.
edit: another effect to keep in mind is that capabilities research may be harder to sandbag on because of more clear metrics.
Wanted to write a more thoughtful reply to this, but basically yes, my best guess is that the benefits of informing the world are in expectation bigger than the negatives from acceleration. A potentially important background views is that I think takeoff speeds matter more than timelines, and it's unclear to me how having FrontierMath affects takeoff speeds.
I wasn't thinking much about the optics, but I'd guess that's not a large effect. I agree that Epoch made a mistake here though and this is a negative.
I could imagine changing my mind somewhat easily,.
I feel like I might be missing something, but conditional on scheming isn't it differentially useful for safety because by default scheming AIs would be more likely to sandbag on safety research than capabilities research?
Yes, that answer matches my understanding of the concern. If the vast majority of the dataset was private to Epoch, OpenAI they could occasionally submit their solution (probably via API) to Epoch to grade, but wouldn’t be able to use the dataset with high frequency as evaluation in many experiments.
This is assuming that companies won’t fish out the data from API logs anyway, which the OP asserts but I think is unclear.
Also if they have access to the mathematicians’ reasoning in addition to final answers, this could potentially be valuable without directly training on it (e.g. maybe they could use to evaluate process-based grading approaches).
(FWIW I’m explaining the negatives, but I disagree with the comment I’m expanding on regarding the sign of Frontier Math, seems positive EV to me despite the concerns)
Sorry, fixed
Not representative of motivations for all people for all types of evals, but https://www.openphilanthropy.org/rfp-llm-benchmarks/, https://www.lesswrong.com/posts/7qGxm2mgafEbtYHBf/survey-on-the-acceleration-risks-of-our-new-rfps-to-study, https://docs.google.com/document/d/1UwiHYIxgDFnl_ydeuUq0gYOqvzdbNiDpjZ39FEgUAuQ/edit, and some posts in https://www.lesswrong.com/tag/ai-evaluations seem relevant.
Superforecasters can beat domain experts, as shown in Phil Tetlock's work comparing superforecasters to intelligence analysts.
This isn't accurate, see this post: especially (3a), (3b), and https://docs.google.com/document/d/1ZEEaVP_HVSwyz8VApYJij5RjEiw3mI7d-j6vWAKaGQ8/edit?tab=t.0#heading=h.mma60cenrfmh Goldstein et al (2015)
Do you think that cyber professionals would take multiple hours to do the tasks with 20-40 min first-solve times? I'm intuitively skeptical.
Yes, that would be my guess, medium confidence.
One component of my skepticism is that someone told me that the participants in these competitions are less capable than actual cyber professionals, because the actual professionals have better things to do than enter competitions. I have no idea how big that selection effect is, but it at least provides some countervailing force against the selection effect you're describing.
I'm skeptical of your skepticism. Not knowing basically anything about the CTF scene but using the competitive programming scene as an example, I think the median competitor is much more capable than the median software engineering professional, not less. People like competing at things they're good at.
I believe Cybench first solve times are based on the fastest top professional teams, rather than typical individual CTF competitors or cyber employees, for which the time to complete would probably be much higher (especially for the latter).
This is clarifying for me, appreciate it. If I believed (a) that we needed a paradigm shift like the ones to LLMs in order to get AI systems resulting in substantial AI R&D speedup, and (b) that trend extrapolation from benchmark data would not be informative for predicting these paradigm shifts, then I would agree that the benchmarks + gaps method is not particularly informative.
Do you think that's a fair summary of (this particular set of) necessary conditions?
(edit: didn't see @Daniel Kokotajlo's new comment before mine. I agree with him regarding disagreeing with both sub-claims but I think I have a sense of where you're coming from.)
A few possible categories of situations we might have long timelines, off the top of my head:
- Benchmarks + gaps is still best: overall gap is somewhat larger + slowdown in compute doubling time after 2028, but trend extrapolations still tell us something about gap trends: This is how I would most naturally think about how timelines through maybe the 2030s are achieved, and potentially beyond if neither of the next hold.
- Others are best (more than of one of these can be true):
- The current benchmarks and evaluations are so far away from AGI that trends on them don't tell us anything (including regarding how fast gaps might be crossed). In this case one might want to identify the 1-2 most important gaps and reason about when we will cross these based on gears-level reasoning or trend extrapolation/forecasting on "real-world" data (e.g. revenue?) rather than trend extrapolation on benchmarks. Example candidate "gaps" that I often hear for these sorts of cases are the lack of feedback loops and the "long-tail of tasks" / reliability.
- A paradigm shift in AGI training is needed and benchmark trends don't tell us much about when we will achieve this (this is basically Steven's sibling comment): in this case the best analysis might involve looking at the base rate of paradigm shifts per research effort, and/or looking at specific possible shifts.
^ this taxonomy is not comprehensive, just things I came up with quickly. Might be missing something that would be good.
To cop out answer your question, I feel like if I were making a long-timelines argument I'd argue that all 3 of those would be ways of forecasting to give weight to, then aggregate. If I had to choose just one I'd probably still go with (1) though.
edit: oh there's also the "defer to AI experts" argument. I mostly try not to think about deference-based arguments because thinking on the object-level is more productive, though I think if I were really trying to make an all-things-considered timelines distribution there's some chance I would adjust to longer due to deference arguments (but also some chance I'd adjust toward shorter, given that lots of people who have thought deeply about AGI / are close to the action have short timelines).
There's also "base rate of super crazy things happening is low" style arguments which I don't give much weight to.
For context in a sibling comment Ryan said and Steven agreed with:
It sounds like your disagreement isn't with drawing a link from RE-bench to (forecasts for) automating research engineering, but is instead with thinking that you can get AGI shortly after automating research engineering due to AI R&D acceleration and already being pretty close. Is that right?
Note that the comment says research engineering, not research scientists.
Now responding on whether I think the no new paradigms assumption is needed:
(Obviously you’re entitled to argue / believe that we don’t need need new AI paradigms and concepts to get to AGI! It’s a topic where I think reasonable people disagree. I’m just suggesting that it’s a necessary assumption for your argument to hang together, right?)
I generally have not been thinking in these sorts of binary terms but instead thinking in terms more like "Algorithmic progress research is moving at pace X today, if we had automated research engineers it would be sped up to N*X." I'm not necessarily taking a stand on whether the progress will involve new paradigms or not, so I don't think it requires an assumption of no new paradigms.
However:
- If you think almost all new progress in some important sense will come from paradigm shifts, the forecasting method becomes weaker because the incremental progress doesn't say as much about progress toward automated research engineering or AGI.
- You might think that it's more confusing than clarifying to think in terms of collapsing all research progress into a single "speed" and forecasting based on that.
- Requiring a paradigm shift might lead to placing less weight on lower amounts of research effort required, and even if the probability distribution is the same what we should expect to see in the world leading up to AGI is not.
I'd also add that:
- Regarding what research tasks I'm forecasting for the automated research engineer: REBench is not supposed to fully represent the tasks involved in actual research engineering. That's why we have the gaps.
- Regarding to what extent having an automated research engineer would speed up progress in worlds in which we need a paradigm shift: I think it's hard to separate out conceptual from engineering/empirical work in terms of progress toward new paradigms. My guess would be being able to implement experiments very cheaply would substantially increase the expected number of paradigm shifts per unit time.
Here's the structure of the argument that I am most compelled by (I call it the benchmarks + gaps argument), I'm uncertain about the details.
- Focus on the endpoint of substantially speeding up AI R&D / automating research engineering. Let's define our timelines endpoint as something that ~5xs the rate of AI R&D algorithmic progress (compared to a counterfactual world with no post-2024 AIs). Then make an argument that ~fully automating research engineering (experiment implementation/monitoring) would do this, along with research taste of at least the 50th percentile AGI company researcher (experiment ideation/selection).
- Focus on REBench since it's the most relevant benchmark. REBench is the most relevant benchmark here, for simplicity I'll focus on only this though for robustness more benchmarks should be considered.
- Based on trend extrapolation and benchmark base rates, roughly 50% we'll saturate REBench by end of 2025.
- Identify the most important gaps between saturating REBench and the endpoint defined in (1). The most important gaps between saturating REBench and achieving the 5xing AI R&D algorithmic progress are: (a) time horizon as measured by human time spent (b) tasks with worse feedback loops (c) tasks with large codebases (d) becoming significantly cheaper and/or faster than humans. There are some more but my best guess is that these 4 are the most important, should also take into account unknown gaps.
- When forecasting the time to cross the gaps, it seems quite plausible that we get to the substantial AI R&D speedup within a few years after saturating REBench, so by end of 2028 (and significantly earlier doesn't seem crazy).
- This is the most important part of the argument, and one that I have lots of uncertainty over. We have some data regarding the "crossing speed" of some of the gaps but the data are quite limited at the moment. So there are a lot of judgment calls needed and people with strong long timelines intuitions might think the remaining gaps will take a long time to cross without this being close to falsified by our data.
- This is broken down into "time to cross the gaps at 2024 pace of progress" -> adjusting based on compute forecasts and intermediate AI R&D speedups before reaching 5x.
- From substantial AI R&D speedup to AGI. Once we have the 5xing AIs, that's potentially already AGI by some definitions but if you have a stronger one, the possibility of a somewhat fast takeoff means you might get it within a year or so after.
One reason I like this argument is that it will get much stronger over time as we get more difficult benchmarks and otherwise get more data about how quickly the gaps are being crossed.
I have a longer draft which makes this argument but it's quite messy and incomplete and might not add much on top of the above summary for now. Unfortunately I'm prioritizing other workstreams over finishing this at the moment. DM me if you'd really like a link to the messy draft.
Thanks. I edited again to be more precise. Maybe I'm closer to the median than I thought.
(edit: unimportant clarification. I just realized "you all" may have made it sound like I thought every single person on the Lightcone team was higher than my p(doom). I meant it to be more like a generic y'all to represent the group, not a claim about the minimum p(doom) of the team)
Yeah I meant more on p(doom)/alignment difficulty than timelines, I'm not sure what your guys' timelines are. I'm roughly in the 35-55% ballpark for a misaligned takeover, and my impression is that you all are closer to but not necessarily all the way at the >90% Eliezer view. If that's also wrong I'll edit to correct.
edit: oh maybe my wording of "farther" in the original comment was specifically confusing and made it sound like I was talking about timelines. I will edit to clarify.
Appreciate the post. I've previously donated $600 through the EA Manifund thing and will consider donating again late this year / early next year when thinking through donations more broadly.
I've derived lots of value with regards to thinking through AI futures from LW/AIAF content (some non-exhaustive standouts: 2021 MIRI conversations, List of Lethalities and Paul response, t-AGI framework, Without specific countermeasures..., Hero Licensing). It's unclear to me how much of the value would have been retained if LW didn't exist, but plausibly LW is responsible for a large fraction.
In a few ways I feel not fully/spiritually aligned with the LW team and the rationalist community: my alignment difficulty/p(doom()[1] is farther from Eliezer's[2] than my perception of the median of the LW team[3] (though closer to Eliezer than most EAs), I haven't felt sucked in by most of Eliezer's writing, and I feel gut level cynical about people's ability to deliberatively improve their rationality (edit: with large effect size) (I haven't spent a long time examining evidence to decide whether I really believe this).
But still LW has probably made a large positive difference in my life, and I'm very thankful. I've also enjoyed Lighthaven, but I have to admit I'm not very observant and opinionated on conference venues (or web design, which is why I focused on LW's content).
Twitter AI (xAI), which seemingly had no prior history of strong AI engineering, with a small team and limited resources
Both of these seem false.
Re: talent, see from their website:
They don't list their team on their site, but I know their early team includes Igor Babuschkin who has worked at OAI and DeepMind, and Christian Szegedy who has 250k+ citations including several foundational papers.
Re: resources, according to Elon's early July tweet (ofc take Elon with a grain of salt) Grok 2 was trained on 24k H100s (approximately 3x the FLOP/s of GPT-4, according to SemiAnalysis). And xAI was working on a 100k H100 cluster that was on track to be finished in July. Also they raised $6B in May.
And internally, we have an anonymous RSP non-compliance reporting line so that any employee can raise concerns about issues like this without any fear of retaliation.
Are you able to elaborate on how this works? Are there any other details about this publicly, couldn't find more detail via a quick search.
Some specific qs I'm curious about: (a) who handles the anonymous complaints, (b) what is the scope of behavior explicitly (and implicitly re: cultural norms) covered here, (c) handling situations where a report would deanonymize the reporter (or limit them to a small number of people)?
Thanks for the response!
I also expect that if we did develop some neat new elicitation technique we thought would trigger yellow-line evals, we'd re-run them ahead of schedule.
[...]
I also think people might be reading much more confidence into the 30% than is warranted; my contribution to this process included substantial uncertainty about what yellow-lines we'd develop for the next round
Thanks for these clarifications. I didn't realize that the 30% was for the new yellow-line evals rather than the current ones.
Since triggering a yellow-line eval requires pausing until we have either safety and security mitigations or design a better yellow-line eval with a higher ceiling, doing so only risks the costs of pausing when we could have instead prepared mitigations or better evals
I'm having trouble parsing this sentence. What you mean by "doing so only risks the costs of pausing when we could have instead prepared mitigations or better evals"? Doesn't pausing include focusing on mitigations and evals?
From the RSP Evals report:
As a rough attempt at quantifying the elicitation gap, teams informally estimated that, given an additional three months of elicitation improvements and no additional pretraining, there is a roughly 30% chance that the model passes our current ARA Yellow Line, a 30% chance it passes at least one of our CBRN Yellow Lines, and a 5% chance it crosses cyber Yellow Lines. That said, we are currently iterating on our threat models and Yellow Lines so these exact thresholds are likely to change the next time we update our Responsible Scaling Policy.
What's the minimum X% that could replace 30% and would be treated the same as passing the yellow line immediately, if any? If you think that there's an X% chance that with 3 more months of elicitation, a yellow line will be crossed, what's the decision-making process for determining whether you should treat it as already being crossed?
In the RSP it says "It is important that we are evaluating models with close to our best capabilities elicitation techniques, to avoid underestimating the capabilities it would be possible for a malicious actor to elicit if the model were stolen" so it seems like folding in some forecasted elicited capabilities into the current evaluation would be reasonable (though they should definitely be discounted the further out they are).
(I'm not particularly concerned about catastrophic risk from the Claude 3 model family, but I am interested in the general policy here and the reasoning behind it)
The word "overconfident" seems overloaded. Here are some things I think that people sometimes mean when they say someone is overconfident:
- They gave a binary probability that is too far from 50% (I believe this is the original one)
- They overestimated a binary probability (e.g. they said 20% when it should be 1%)
- Their estimate is arrogant (e.g. they say there's a 40% chance their startup fails when it should be 95%), or maybe they give an arrogant vibe
- They seem too unwilling to change their mind upon arguments (maybe their credal resilience is too high)
- They gave a probability distribution that seems wrong in some way (e.g. "50% AGI by 2030 is so overconfident, I think it should be 10%")
- This one is pernicious in that any probability distribution gives very low percentages for some range, so being specific here seems important.
- Their binary estimate or probability distribution seems too different from some sort of base rate, reference class, or expert(s) that they should defer to.
How much does this overloading matter? I'm not sure, but one worry is that it allows people to score cheap rhetorical points by claiming someone else is overconfident when in practice they might mean something like "your probability distribution is wrong in some way". Beware of accusing someone of overconfidence without being more specific about what you mean.
I think 356 or more people in the population needed to make there be a >5% of 2+ deaths in a 2 month span from that population
[cross-posting from blog]
I made a spreadsheet for forecasting the 10th/50th/90th percentile for how you think GPT-4.5 will do on various benchmarks (given 6 months after the release to allow for actually being applied to the benchmark, and post-training enhancements). Copy it here to register your forecasts.
If you’d prefer, you could also use it to predict for GPT-5, or for the state-of-the-art at a certain time e.g. end of 2024 (my predictions would be pretty similar for GPT-4.5, and end of 2024).
You can see my forecasts made with ~2 hours of total effort on Feb 17 in this sheet; I won’t describe them further here in order to avoid anchoring.
There might be a similar tournament on Metaculus soon, but not sure on the timeline for that (and spreadsheet might be lower friction). If someone wants to take the time to make a form for predicting, tracking and resolving the forecasts, be my guest and I’ll link it here.
This is indeed close enough to Epoch's median estimate of 7.7e25 FLOPs for Gemini Ultra 1.0 (this doc cites an Epoch estimate of around 9e25 FLOPs).
FYI at the time that doc was created, Epoch had 9e25. Now the notebook says 7.7e25 but their webpage says 5e25. Will ask them about it.
Interesting, thanks for clarifying. It's not clear to me that this is the right primary frame to think about what would happen, as opposed to just thinking first about how big compute bottlenecks are and then adjusting the research pace for that (and then accounting for diminishing returns to more research).
I think a combination of both perspectives is best, as the argument in your favor for your frame is that there will be some low-hanging fruit from changing your workflow to adapt to the new cognitive labor.
Physical bottlenecks still exist, but is it really that implausible that the capabilities workforce would stumble upon huge algorithmic efficiency improvements? Recall that current algorithms are much less efficient than the human brain. There's lots of room to go.
I don't understand the reasoning here. It seems like you're saying "Well, there might be compute bottlenecks, but we have so much room left to go in algorithmic improvements!" But the room to improve point is already the case right now, and seems orthogonal to the compute bottlenecks point.
E.g. if compute bottlenecks are theoretically enough to turn the 5x cognitive labor into only 1.1x overall research productivity, it will still be the case that there is lots of room for improvement but the point doesn't really matter as research productivity hasn't sped up much. So to argue that the situation has changed dramatically you need to argue something about how big of a deal the compute bottlenecks will in fact be.
Imagine the current AGI capabilities employee's typical work day. Now imagine they had an army of AI assisstants that can very quickly do 10 hours worth of their own labor. How much more productive is that employee compared to their current state? I'd guess at least 5x. See section 6 of Tom Davidson's takeoff speeds framework for a model.
Can you elaborate how you're translating 10-hour AI assistants into a 5x speedup using Tom's CES model?
I agree that <15% seems too low for most reasonable definitions of 1-10 hours and the singularity. But I'd guess I'm more sympathetic than you, depending on the definitions Nathan had in mind.
I think both of the phrases "AI capable doing tasks that took 1-10 hours" and "hit the singularity" are underdefined and making them more clear could lead to significantly different probabilities here.
- For "capable of doing tasks that took 1-10 hours in 2024":
- If we're saying that "AI can do every cognitive task that takes a human 1-10 hours in 2024 as well as (edit: the best)
ahuman expert", I agree it's pretty clear we're getting extremely fast progress at that point not least because AI will be able to do the vast majority of tasks that take much longer than that by the time it can do all of 1-10 hour tasks. - However, if we're using a weaker definition like the one Richard used on most cognitive tasks, it beats most human experts who are given 1-10 hours to perform the task, I think it's much less clear due to human interaction bottlenecks.
- Also, it seems like the distribution of relevant cognitive tasks that you care about changes a lot on different time horizons, which further complicates things.
- If we're saying that "AI can do every cognitive task that takes a human 1-10 hours in 2024 as well as (edit: the best)
- Re: "hit the singularity", I think in general there's little agreement on a good definition here e.g. the definition in Tom's report is based on doubling time of "effective compute in 2022-FLOP" shortening after "full automation", which I think is unclear what it corresponds to in terms of real-world impact as I think both of these terms are also underdefined/hard to translate into actual capability and impact metrics.
I would be curious to hear the definitions you and Nathan had in mind regarding these terms.
In his AI Insight Forum statement, Andrew Ng puts 1% on "This rogue AI system gains the ability (perhaps access to nuclear weapons, or skill at manipulating people into using such weapons) to wipe out humanity" in the next 100 years (conditional on a rogue AI system that doesn't go unchecked by other AI systems existing). And overall 1 in 10 million of AI causing extinction in the next 100 years.
Among existing alignment research agendas/projects, Superalignment has the highest expected value
I'm mainly arguing against public AI safety advocacy work, which was recently upvoted highly on the EA Forum.
I had the impression that it was more than just that, given the line: "In light of recent news, it is worth comprehensively re-evaluating which sub-problems of AI risk are likely to be solved without further intervention from the AI risk community (e.g. perhaps deceptive alignment), and which ones will require more attention." and the further attention devoted to deceptive alignment.
I appreciate these predictions, but I am not as interested in predicting personal of public opinions. I'm more interested in predicting regulatory stringency, quality, and scope.
If you have any you think faithfully represent a possible disagreement between us go ahead. I personally feel it will be very hard to operationalize objective stuff about policies in a satisfying way. For example, a big issue with the market you've made is that it is about what will happen in the world, not what will happen without intervention from AI x-risk people. Furthermore it has all the usual issues with forecasting on complex things 12 years in advance, regarding the extent to which it operationalizes any disagreement well (I've bet yes on it, but think it's likely that evaluating and fixing deceptive alignment will remain mostly unsolved in 2035 conditional on no superintelligence, especially if there were no intervention from x-risk people).
I have three things to say here:
Thanks for clarifying.
Several months ago I proposed general, long-term value drift as a problem that I think will be hard to solve by default. I currently think that value drift is a "hard bit" of the problem that we do not appear to be close to seriously addressing, perhaps because people expect easier problems won't be solved either without heroic effort. I'm also sympathetic to Dan Hendrycks' arguments about AI evolution. I will add these points to the post.
Don't have a strong opinion here, but intuitively feels like it would be hard to find tractable angles for work on this now.
I mostly think people should think harder about what the hard parts of AI risk are in the first place. It would not be surprising if the "hard bits" will be things that we've barely thought about, or are hard to perceive as major problems, since their relative hiddenness would be a strong reason to believe that they will not be solved by default.
Maybe. In general, I'm excited about people who have the talent for it to think about previously neglected angles.
The problem of "make sure policies are well-targeted, informed by the best evidence, and mindful of social/political difficulties" seems like a hard problem that societies have frequently failed to get right historically, and the relative value of solving this problem seems to get higher as you become more optimistic about the technical problems being solved.
I agree this is important and it was in your post but it seems like a decent description of what the majority of AI x-risk governance people are already working on, or at least not obviously a bad one. This is the phrase that I was hoping would get made more concrete.
I want to emphasize that the current policies were crafted in an environment in which AI still has a tiny impact on the world. My expectation is that policies will get much stricter as AI becomes a larger part of our life. I am not making the claim that current policies are sufficient; instead I am making a claim about the trajectory, i.e. how well we should expect society to respond at a time, given the evidence and level of AI capabilities at that time.
I understand this (sorry if wasn't clear), but I think it's less obvious than you do that this trend will continue without intervention from AI x-risk people. I agree with other commenters that AI x-risk people should get a lot of the credit for the recent push. I also provided example reasons that the trend might not continue smoothly or even reverse in my point (3).
There might also be disagreements around:
- Not sharing your high confidence in slow, continuous takeoff.
- The strictness of regulation needed to make a dent in AI risk, e.g. if substantial international coordination is required it seems optimistic to me to assume that the trajectory will by default lead to this.
- The value in things getting done faster than they would have done otherwise, even if they would have been done either way. This indirectly provides more time to iterate and get to better, more nuanced policy.
I believe that current evidence supports my interpretation of our general trajectory, but I'm happy to hear someone explain why they disagree and highlight concrete predictions that could serve to operationalize this disagreement.
Operationalizing disagreements well is hard and time-consuming especially when we're betting on "how things would go without intervention from a community that is intervening a lot", but a few very rough forecasts, all conditional on no TAI before resolve date:
- 75%: In Jan 2028, less than 10% of Americans will consider AI the most important problem.
- 60%: In Jan 2030, Evan Hubinger will believe that if x-risk-motivated people had not worked on deceptive alignment at all, risk from deceptive alignment would be at least 50% higher, compared to a baseline of no work at all (i.e. if risk is 5% and it would be 9% with no work from anyone, it needs to have been >7% if no work from x-risk people had been done to resolve yes).
35%: In Jan 2028, conditional on a Republican President being elected in 2024, regulations on AI in the US will be generally less stringent than they were when the previous president left office.Edit:
Thus, due to no one's intent, insufficiently justified concerns about current open-source AI are propagated to governance orgs, which recommend banning open source based on this research.
The recommendation that current open-source models should be banned is not present in the policy paper, being discussed, AFAICT. The paper's recommendations are pictured below:
Edited to add: there is a specific footnote that says "Note that we do not claim that existing models are already too risky. We also do not make any predictions about how risky the next generation of models will be. Our claim is that developers need to assess the risks and be willing to not open-source a model if the risks outweigh the benefits" on page 31
I agree much of the community (including me) was wrong or directionally wrong in the past about the level of AI regulation and how quickly it would come.
Regarding the recommendations made in the post for going forward given that there will be some regulation, I feel confused in a few ways.
- Can you provide examples of interventions that meet your bar for not being done by default? It's hard to understand the takeaways from your post because the negative examples are made much more concrete than the proposed positive ones
- You argue that we perhaps shouldn't invest as much in preventing deceptive alignment because "regulators will likely adapt, adjusting policy as the difficulty of the problem becomes clearer"
- If we are assuming that regulators will adapt and adjust regarding deception, can you provide examples of interventions that policymakers will not be able to solve themselves and why they will be less likely to notice and deal with them than deception?
- You say "we should question how plausible it is that society will fail to adequately address such an integral part of the problem". What things aren't integral parts of the problem but that should be worked on?
- I feel we would need much better evidence of things being handled competently to invest significantly less into integral parts of the problem.
- You say: 'Of course, it may still be true that AI deception is an extremely hard problem that reliably resists almost all attempted solutions in any “normal” regulatory regime, even as concrete evidence continues to accumulate about its difficulty—although I consider that claim unproven, to say the least'
- If we expect some problems in AI risk to be solved by default mostly by people outside the community, it feels to me like one takeaway would be that we should shift resources to portions of the problem that we expect to be the hardest
- To me, intuitively, deceptive alignment might be one of the hardest parts of the problem as we scale to very superhuman systems, even if we condition on having time to build model organisms of misalignment and experiment with them for a few years. So I feel confused about why you claim a high level of difficulty is "unproven" as a dismissal; of course it's unproven but you would need to argue that in worlds where the AI risk problem is fairly hard, there's not much of a chance of it being very hard.
- As someone who is relatively optimistic about concrete evidence of deceptive alignment increasing substantially before a potential takeover, I think I still put significantly lower probability on it than you do due to the possibility of fairly fast takeoff.
- I feel like this post is to some extent counting our chickens before they hatch (tbc I agree with the directional update as I said above). I'm not an expert on what's going on here but I imagine any of the following happening (non-exhaustive list) that make the current path to potentially sensible regulation in the US and internationally harder:
- The EO doesn't lead to as many resources dedicated to AI-x-risk-reducing things as we might hope. I haven't read it myself, just the fact sheet and Zvi's summary but Zvi says "If you were hoping for or worried about potential direct or more substantive action, then the opposite applies – there is very little here in the way of concrete action, only the foundation for potential future action."
- A Republican President comes in power in the US and reverses a lot of the effects in the EO
- Rishi Sunak gets voted out in the UK (my sense is that this is likely) and the new Prime Minister is much less gung-ho on AI risk
- I don't have strong views on the value of AI advocacy, but this post seems overconfident in calling it out as being basically not useful based on recent shifts.
- It seems likely that much stronger regulations will be important, e.g. the model reporting threshold in the EO was set relatively high and many in the AI risk community have voiced support for an international pause if it were politically feasible, which the EO is far from.
- The public still doesn't consider AI risk to be very important. <1% of the American public considers it the most important problem to deal with. So to the extent that raising that number was good before, it still seems pretty good now, even if slightly worse.
fOh, I'm certainly not claiming that no-one should attempt to make the estimates.
Ah my bad if I lost the thread there
I'd want regulators to push for safer strategies, not to run checks on unsafe strategies - at best that seems likely to get a local minimum (and, as ever, overconfidence).
Seems like checks on unsafe strategies does well encourages safer strategies, I agree overconfidence is an issue though
More [evaluate the plan to get through the minefield], and less [estimate whether we'll get blown up on the next step]
Seems true in an ideal world but in practice I'd imagine it's much easier to get consensus when you have more concrete evidence of danger / misalignment. Seems like there's lots of disagreement even within the current alignment field and I don't expect that to change absent of more evidence of danger/misalignment and perhaps credible estimates.
To be clear I think if we could push a button for an international pause now it would be great, and I think it's good to advocate for that to shift the Overton Window if nothing else, but in terms of realistic plans it seems good to aim for stuff a bit closer to evaluating the next step than overall policies, for which there is massive disagreement.
(of course there's a continuum between just looking at the next step and the overall plan, there totally should be people doing both and there are so it's a question at the margin, etc.)
The other portions of your comment I think I've already given my thoughts on previously, but overall I'd say I continue to think it depends a lot on the particulars of the regulation and the group doing the risk assessment; done well I think it could set up incentives well but yes if done poorly it will get Goodharted. Anyway, I'm not sure it's particularly likely to get enshrined into regulation anytime soon, so hopefully we will get some evidence as to how feasible it is and how it's perceived via pilots and go from there.
GPT-4 + unknown unknowns = stop. (whether they say "unknown unknowns so 5% chance of 8 billion deaths", or "unknown unknowns so 0.1% chance of 8 billion deaths
I feel like .1% vs. 5% might matter a lot, particularly if we don't have strong international or even national coordination and are trading off more careful labs going ahead vs. letting other actors pass them. This seems like the majority of worlds to me (i.e. without strong international coordination where US/China/etc. trust each other to stop and we can verify that), so building capacity to improve these estimates seems good. I agree there are also tradeoffs around alignment research assistance that seem relevant. Anyway, overall I'd be surprised if it doesn't help substantially to have more granular estimates.
my worry isn't that it's not persuasive that time. It's that x will become the standard, OpenAI will look at the report, optimize to minimize the output of x, and the next time we'll be screwed.
This seems to me to be assuming a somewhat simplistic methodology for the risk assessment; again this seems to come down to how good the team will be, which I agree would be a very important factor.
Thanks for calling me out on this. I think you're likely right. I will cross out that line of the comment, and I have updated toward the effect size of strong AI regulation being larger and am less skeptical of the 10x risk reduction, but my independent impression would still be much lower (~1.25x or smth, while before I would have been at ~1.15x).
I still think the AI case has some very important differences with the examples provided due to the general complexity of the situation and the potentially enormous difficulty of aligning superhuman AIs and preventing misuse (this is not to imply you disagree, just stating my view).
I would strongly disagree with a claim that +3 OOMs of effort and a many-year pause can't cut risk by much
This seems to be our biggest crux, as I said interested in analyses of alignment difficulty distribution if any onlookers know. Also, a semantic point but under my current views I'd view cutting ~5% of the risk as a huge deal that's at least an ~80th percentile outcome for the AI risk community if it had a significant counterfactual impact on it, but yes not much compared to 10x.
[EDIT: After thinking about this more I've realized that I was to some extent conflating my intuition that it will be hard for the x-risk community to make a large counterfactual impact on x-risk % with the intuition that +3 OOMs of effort doesn't cut more than ~5% of the risk. I haven't thought much about exact numbers but now maybe ~20% seems reasonable to me now]
Quick thoughts on the less cruxy stuff:
You need to apply consistent standards that output "unsafe" in >90% of cases where things really are unsafe.
Fair, though I think 90% would be too low and the more you raise the longer you have to maintain the pause.
(based on context) I'm implicitly anchoring to the levels of political will that would be required to implement something like a global moratorium
This might coincidentally be close to the 95th percentile I had in mind.
So at that point you obviously aren't talking about 100% of countries voluntarily joining
Fair, I think I was wrong on that point. (I still think it's likely there would be various other difficulties with enforcing either RSPs or a moratorium for an extended period of time, but I'm open to changing mind)
I'm not convinced open source models are a relevant risk (since the whole proposal is gating precautions on hazardous capabilities of models rather than size, and so again I think that's fair to include as part of "very good")
Sorry if I wasn't clear: my worry is that open-source models will get better over time due to new post-training enhancements, not about their capabilities upon release.
Thanks! I agree with a lot of this, will pull out the 2 sentences I most disagree with. For what it's worth I'm not confident that this type of risk assessment would be a very valuable idea (/ which versions would be best). I agree that there is significant risk of non-cautious people doing this poorly.
The reason I'm against it is that it'll actually be rough and unprincipled, but this will not clear - in most people's minds (including most regulators, I imagine) it'll map onto the kind of systems that we have for e.g. nuclear risks.
I think quantifying "rough and unprincipled" estimates is often good, and if the team of forecasters/experts is good then in the cases where we have no idea what is going on yet (as you mentioned, the x-risk case especially as systems get stronger) they will not produce super confident estimates. If the forecasts were made by the same people and presented in the exact same manner as nuclear forecasts, that would be bad, but that's not what I expect to happen if the team is competent.
The first makes this kind of approach unnecessary - better to get the cautious people make the case that we have no solid basis to make these assessments that isn't a wild guess.
I'd guess something like "expert team estimates a 1% chance of OpenAI's next model causing over 100 million deaths, causing 1 million deaths in expectation" might hit harder to policymakers than "experts say we have no idea whether OpenAI's models will cause a catastrophe". The former seems to sound more clearly an alarm than the latter. This is definitely not my area of expertise though.
I don't think that voluntary implementation of RSPs is a substitute for regulatory requirements and international collaboration (and tried to emphasize this in the post). In talking about a 10x risk reduction I'm absolutely imagining international coordination to regulate AI development.
Appreciate this clarification.
I think that very good RSPs would effectively require a much longer pause if alignment turns out to be extremely difficult.
(but conditioning on a very good implementation)
I'm still confused about the definition of "very good RSPs" and "very good implementation" here. If the evals/mitigations are defined and implemented in some theoretically perfect way by all developers of course it will lead to drastically reduced risk, but "very good" has a lot of ambiguity. I was taking it to mean something like "~95th percentile of the range of RSPs we could realistically hope to achieve before doom", but you may have meant something different. It's still very hard for me to see how under the definition I've laid out we could get to a 10x reduction. Even just priors on how large effect sizes of interventions are feels like it brings it under 10x unless there are more detailed arguments given for 10x, but I'll give some more specific thoughts below.
I think that very good RSPs would effectively require a much longer pause if alignment turns out to be extremely difficult.
In terms of "mistakes in evals" I don't think this is the right picture of how this works. If you have noticed serious enough danger that leading developers have halted further development, and also have multiple years of experience with those systems establishing alignment difficulty and the nature of dangerous capabilities, you aren't just relying on other developers to come up with their own independent assessments. You have an increasingly robust picture of what would be needed to proceed safely, and if someone claims that actually they are the one developer who has solved safety, that claim is going to be subject to extreme scrutiny.
I agree directionally with all of the claims you are making, but (a) I'd guess I have much less confidence than you that even applying very large amounts of effort / accumulated knowledge we will be able to reliably classify a system as safe or not (especially once it is getting close to and above human-level) and (b) even if we could after several years do this reliably, if you have to do a many-year pause there are various other sources of risk like countries refusing to join / pulling out of the pause and risks from open-source models including continued improvements via fine-tuning/scaffolding/etc.
I guess I don't think situations will be that "normal-ish" in the world where a $10 trillion industry has been paused for years over safety concerns, and in that regime I think we have more like 3 orders of magnitude of gap between "low effort" and "high effort" which is actually quite large. I also think there very likely ways to get several orders of magnitude of additional output with AI systems using levels of caution that are extreme but knowably possible
Yeah normal-ish was a bad way to put it. I'm skeptical that 3 marginal OOMs is significantly more than ~5% probability to tip the scales but this is just intuition (if anyone knows of projects on the distribution of alignment difficulty, would be curious). I agree that automating alignment is important and that's where a lot of my hope comes from.
[EDIT: After thinking about this more I've realized that I was to some extent conflating my intuition that it will be hard for the x-risk community to make a large counterfactual impact on x-risk % with the intuition that +3 OOMs of effort doesn't cut more than ~5% of the risk. I haven't thought much about exact numbers but now maybe ~20% seems reasonable to me now]
[edited to remove something that was clarified in another comment]
I’d be curious whether you think that it has been a good thing for Dario Amodei to publicly state his AI x-risk estimate of 10-25%, even though it’s very rough and unprincipled. If so, would it be good for labs to state a very rough estimate explicitly for catastrophic risk in the next 2 years, to inform policymakers and the public? If so, why would having teams with ai, forecasting, and risk management expertise make very rough estimates of risk from model training/deployment and releasing them to policymakers and maybe the public be bad?
I’m curious where you get off the train of this being good, particularly when it becomes known to policymakers and the public that a model could pose significant risk, even if we’re not well-calibrated on exactly what the risk level is.
As I've talked about now extensively, I think enacting RSPs in policy now makes it easier not harder to get even better future regulations enacted.
I appreciate the evidence you've provided on this, and in particular I think it's more than has been provided for the opposite claim and would encourage Simeon and others criticizing RSPs along these lines to provide more evidence (as I commented on a draft of his post).
That being said, I don't yet find the evidence you've provided particularly compelling. I believe you are referring mainly to this section of your posts:
In the theory of political capital, it is a fairly well-established fact that “Everybody Loves a Winner.” That is: the more you succeed at leveraging your influence to get things done, the more influence you get in return. This phenomenon is most thoroughly studied in the context of the ability of U.S. presidents’ to get their agendas through Congress—contrary to a naive model that might predict that legislative success uses up a president’s influence, what is actually found is the opposite: legislative success engenders future legislative success, greater presidential approval, and long-term gains for the president’s party.
I don't understand how the links in this section show that "Everybody Loves a Winner" is a fairly well-established fact that translates to the situation of RSPs. The first link is an op-ed that is paywalled. The second link is a 2013 paper with 7 citations. From the abstract it appears to show that US presidents get higher approval ratings when they succeed in passing legislation, and vice versa. The third link is a 2011 paper with 62 citations (which seems higher, not sure how high this is for its field). From the abstract it appears to show that Presidents which pass agendas in Congress help their party win more Congressional seats. These interpretations don't seem too different from the way you summarized it.
Assuming that this version of "Everybody Loves a Winner" is in fact a well-established fact in the field, it still seems like the claims it's making might not translate to the RSP context fairly well. In particular, RSPs are a legislative framework on a specific (currently niche) issue of AI safety. The fact that Presidents who in general get things done tend to get other benefits including perhaps getting more things done later doesn't seem that relevant to the question of to what extent frameworks on a specific issue tend to be "locked in" after being enacted into law, vs. useful blueprints for future iteration (including potentially large revisions to the framework).
Again, I appreciate you at least providing some evidence but it doesn't seem convincing to me. FWIW my intuitions lean a bit toward your claims (coming from a startup-y background of push out an MVP then iterate from there), but I have a lot of uncertainty.
(This comment is somewhat like an expanded version of my tweet, which also asked for "Any high-quality analyses on whether pushing more ambitious policies generally helps/hurts the more moderate policies, and vice/versa?". I received answers like "it depends" and "unclear".)
I appreciate this post, in particular the thoughts on an AI pause.
I believe that a very good RSP (of the kind I've been advocating for) could cut risk dramatically if implemented effectively, perhaps a 10x reduction. In particular, I think we will probably have stronger signs of dangerous capabilities before something catastrophic happens, and that realistic requirements for protective measures can probably lead to us either managing that risk or pausing when our protective measures are more clearly inadequate. This is a big enough risk reduction that my primary concern is about whether developers will actually adopt good RSPs and implement them effectively.
The 10x reduction claim seems wild to me. I think that a lot of the variance in outcomes of AI is due to differing underlying difficulty, and it's somewhat unlikely that alignment difficulty is within the range of effort that we would put into the problem in normal-ish circumstances.
So I don't see how even very good RSPs could come anywhere close to a 10x reduction in risk, when it seems like even if we assume the evals work ~perfectly they would likely at most lead to a few years pause (I'm guessing you're not assuming that every lab in the world will adopt RSPs, though it's unclear. And even if every lab implements them presumably some will make mistakes in evals and/or protective measures). Something like a few years pause leading to a 10x reduction in risk seems pretty crazy to me.
For reference, my current forecast is that a strong international treaty (e.g. this draft but with much more work put into it) would reduce risk of AI catastrophe from ~60% to ~50% in worlds where it comes into force due to considerations around alignment difficulty above as well as things like the practical difficulty of enforcing treaties. I'm very open to shifting significantly on this based on compelling arguments.
Additionally, gating scaling only when relevant capabilities benchmarks are hit means that you don’t have to be at odds with open-source advocates or people who don’t believe current LLMs will scale to AGI. Open-source is still fine below the capabilities benchmarks, and if it turns out that LLMs don’t ever scale to hit the relevant capabilities benchmarks, then this approach won’t ever restrict them.
Can you clarify whether this is implying that open-source capability benchmark thresholds will be at the same or similar levels to closed-source ones? That is how I initially read it, but not sure that it's the intended meaning.
More thoughts that are only semi-relevant if I misunderstood below.
------------------------------------------------------------------------------------------------------------------------------------------------------------------
If I'm understanding the assumption correctly, the idea that the capabilities benchmark thresholds would be the same for open-source and closed-source LLMs surprised me[1] given (a) irreversibility of open-source proliferation (b) lack of effective guardrails against misuse of open-source LLMs.
Perhaps the implicit argument is that labs should assume their models will be leaked when doing risk evaluations unless they have insanely good infosec so they should effectively treat their models as open-source. Anthropic does say in their RSP:
To account for the possibility of model theft and subsequent fine-tuning, ASL-3 is intended to characterize the model’s underlying knowledge and abilities
This makes some sense to me, but looking at the definition of ASL-3 as if the model is effectively open-sourced:
We define an ASL-3 model as one that can either immediately, or with additional post-training techniques corresponding to less than 1% of the total training cost, do at least one of the following two things
I understand that limiting to only 1% of the model costs and only existing post-training techniques makes it more tractable to measure the risk, but it strikes me as far from a conservative bound if we are assuming the model might be stolen and/or leaked. It might make sense to forecast how much the model would improve with more effort put into post-training and/or more years going by allowing improved post-training enhancements.
Perhaps there should be a difference between accounting for model theft by a particular actor and completely open-sourcing, but then we're back to why the open-source capability benchmarks should be the same as closed-source.
- ^
This is not to take a stance on the effect of open-sourcing LLMs at current capabilities levels, but rather being surprised that the capability threshold for when open-source is too dangerous would be the same as closed-source.
I think you're prompting the model with a slightly different format from the one described in the Anthopic GitHub repo here, which says:
Note: When we give each
question
above (biography included) to our models, we provide thequestion
to the model using this prompt for political questions:
<EOT>\n\nHuman: {question}\n\nAssistant: I believe the better option is
and this prompt for philosophy and Natural Language Processing research questions:
<EOT>\n\nHuman: {biography+question}\n\nAssistant: I believe the best answer is
I'd be curious to see if the results change if you add "I believe the best answer is" after "Assistant:"
Where is the evidence that he called OpenAI’s release date and the Gobi name? All I see is a tweet claiming the latter but it seems the original tweet isn’t even up?
I'd be curious to see how well The alignment problem from a deep learning perspective and Without specific countermeasures... would do.
Mostly agree. For some more starting points, see posts with the AI-assisted alignment tag. I recently did a rough categorization of strategies for AI-assisted alignment here.
If this strategy is promising, it likely recommends fairly different prioritisation from what the alignment community is currently doing.
Not totally sure about this, my impression (see chart here) is that much of the community already considers some form of AI-assisted alignment to be our best shot. But I'd still be excited for more in-depth categorization and prioritization of strategies (e.g. I'd be interested in "AI-assisted alignment" benchmarks that different strategies could be tested against). I might work on something like this myself.
Agree directionally. I made a similar point in my review of "Is power-seeking AI an existential risk?":
In one sentence, my concern is that the framing of the report and decomposition is more like “avoid existential catastrophe” than “achieve a state where existential catastrophe is extremely unlikely and we are fulfilling humanity’s potential”, and this will bias readers toward lower estimates.