Where Would Good Forecasts Most Help AI Governance Efforts?
post by Violet Hour · 2025-02-11T18:15:33.082Z · LW · GW · 0 commentsContents
Introduction Forecasting AI Labs’ Safety Practices Forecasting AI Labs’ Integrity Forecasting AI Regulation (“If-Then Commitments”) Conditional Forecasting: Evals for Intentional Deception What Else is Worth Including? Appendix: The ‘CBRN’ Threshold None No comments
Thanks to Josh Rosenberg for comments and discussion.
Introduction
One of LessWrong’s historical troves is its pre-ChatGPT AGI forecasts. Not just for the specific predictions people offered, but for observing which sorts of generative processes produced which kinds of forecasts. For instance:
[Nuno [LW(p) · GW(p)] (Median AGI Timeline) = 2072]: “I take as a starting point datscilly's own prediction, i.e., the result of applying Laplace's rule from the Dartmouth conference. This seems like the most straightforward historical base rate / model to use … I then apply some smoothing.”
[Kokotajlo [LW(p) · GW(p)] (Median AGI Timeline) = 2034]: “I think that if transformative AI is achievable in the next five orders of magnitude of compute improvement (e.g. prosaic AGI?), it will likely be achieved in the next five years or so. I also am slightly more confident [than Ethan Perez] that it is, and slightly less confident that TAI will ever be achieved.”
I think these sources are valuable. To the extent that forecasting is a skill at all, there must be certain kinds of cognitive processes which, on average, produce better forecasts than others. So, the historical evidence at least gives us some (though perhaps slender) insight into which sorts of processes might deliver more reasonable forecasts in the present day.
And I, personally, think this is kind of information is especially fruitful. For reasons both conceptual [EA · GW] and empirical [EA · GW], I think we should be skeptical about the degree to which “whatever ability superforecasters have” generalizes to the most action-relevant AI questions.
For this reason, I’ve suggested four potential questions related to the future of AI that seem worth forecasting – primarily focused on issues of AI governance/regulation, as there are already some good forecast questions focused more narrowly on capabilities. The initial thought was to get feedback on questions here, and then post them to forecasting platforms — ultimately, though, I’m just interested in whatever methods we have for eliciting high-quality forecasts.
It would be nice for me if people provide feedback/forecasts on these questions. But there are a couple of other reasons why I think doing so could be valuable.
- A recent article on “The Failed Strategy of AI Doomers” attracted a large number of (in many cases critical) commenters [LW · GW]. To the extent that AI doomers do have a problem with political strategy (I’ve admittedly not read all of the discussion), then explicitly attempting to forecast near-term questions on matters of AI policy/regulation seems like a potentially salutary corrective.
- Specifically, explicit forecasts on near-term resolvable questions could either: (i) force more careful thinking about political dynamics relevant to AI regulation, or (ii) surface epistemic blindspots in more transparent ways.
- To the extent that the community has access to forecasting prowess (either internally or through external connections), it’s worth thinking carefully about how we might best use this forecasting prowess.
- I suggest some quick questions that could be forecasted, but I expect that they could improved in myriad ways. Imo, thinking carefully about "what near-term outcomes should I be tracking?" seems beneficial for longer-term strategy.
Forecasting AI Labs’ Safety Practices
Initially, I wanted to construct a forecast question which asked whether AI labs would renege on explicit safety commitments outlined as part of company policy. In turn, this would act as a proxy for how trustworthy AI companies are, and (retrospectively) which genres of broad cognitive strategies proved most successful at forecasting AI labs' future trustworthiness.
This task ... proved less than straightforward. Largely, this was because a large chunk of the ‘commitments’ outlined in Deepmind, Anthropic, and OpenAI’s respective safety frameworks were pretty vague.
- DeepMind’s Frontier Safety Framework that “describes levels of security/deployment mitigations that may [emphasis mine] be applied to models/model weights”, alongside corresponding measures often described in vaporous terms.
- OpenAI’s Preparedness Framework says that, if they reach (or are forecasted to reach) a model with “critical” pre-mitigation risk, then “we would focus our efforts as a company towards solving these safety challenges and only continue with capabilities-enhancing development if we can reasonably assure ourselves (via the operationalization processes) that it is safe to do so.”
- The ‘operationalization processes’ are reasonably specific, but an explicit caveat for potentially “safety-enhancing development” is not ruled out, and it’s unclear what “focus our efforts” actually commits them to, concretely.
- For ‘ASL-3’ capabilities, Anthropic’s RSP commits to containment measures which “harden security such that non-state attackers are unlikely to be able to steal model weights”.
- It's unclear what “unlikely” means in this case (<50%? From the perspective of whom?), nor can I see it defined anywhere in their document.
I should clarify that not all lab commitments were this wishy-washy. Still, the vagueness of many AI labs’ commitments – in addition to Ryan Greenblatt’s recent comment [LW(p) · GW(p)] on Anthropic’s RSP – made me think the following Anthropic-centric question was worth asking. In part, this is because many in this community feel more favorable towards Anthropic than other leading AI labs, and I'm interested in people's reasoning for different levels of cynicism.
Q1: What’s your probability that, by EOY 2025, Anthropic’s LTBT appoints:
- The maximum number of board members it can permissibly appoint, where two board members possess technical expertise in AI safety or security, at least one of whom possesses expertise in AI safety, and:
- There are no credible reports (e.g., from whistleblowers) which claim that Anthropic employees do not know the circumstances in which the board could (and should) fire the CEO due to safety/public interest concerns.
This question is admittedly modest, and I expect many want Anthropic to adopt more stringent measures. However, a resolution of “yes” to Q1 should at least be: (i) publicly verifiable, (ii) a positive signal of the LTBT’s independence, and (iii) shouldn’t undermine Anthropic’s business model. Better suggestions welcome.
Forecasting AI Labs’ Integrity
Although the safety frameworks currently offered by AI labs are pretty vague, one might hope that they become less vague in the future. And, in fairness, I think that some policies (e.g., many from Anthropic) are precise enough to allow for at least some clearcut failures. Hence, Q2.
Q2: Suppose that, by EOY 2026, either Deepmind, Anthropic, or OpenAI have developed a model which would – if tested – meet OpenAI's "critical" threshold for CBRN. What is your probability that, by EOY 2027, at least one company who develops such a model reneges on some material commitment[1] outlined in their safety frameworks.
To keep the main text (relatively) brief, I’ve relegated detailed comments about operationalizations to footnotes and the appendix. Briefly, though:
- The ‘critical’ threshold for CBRN (see appendix) seems like a reasonable correlate of producing an AI system which, if deployed, would pose a non-trivial risk of catastrophic misuse.
- Any action explicitly described in the document following “will,” “shall,” or “commit to” that is designated as a baseline or gating condition for continuing development, deployment, or key security counts as a company “commitment”.
- Moreover, the standards for “reneging” will be read in a way that is maximally lenient to the company in question.
So, for instance, Anthropic would only count as reneging on a commitment “to harden security such that non-state attackers are unlikely to be able to steal model weights” if we see both: (i) credible reports that Anthropic leadership considered it “not unlikely” for non-state attackers to steal model weights, and (ii) thereafter presented no attempts to harden security practices.
Forecasting AI Regulation (“If-Then Commitments”)
There’s growing interest in “if-then commitments” from the AI governance community, but (afaict) little public effort attempting to forecast whether people in this strategic bloc have sufficient power to instantiate their ideas as regulation. Hence, Question 3.
Q3: What is your probability that by EOY 2028, the public release[2] of some AI systems by private firms will, under US law, require:
- Third-party tests[3] for whether the AI system meets some predefined capability threshold, such that:
- If a threshold is met, the company is forbidden from publicly releasing the AI in the absence of:
- Passing further assessments from third-party auditors, which test whether users can elicit the threshold capability, and
- Passing further assessments from third-party auditors, which test how easily the model's weights can be accessed.
The basic idea behind “if-then commitments” is laid out by Holden in this piece. More detailed operationalizations for “public release” and “third-party testing” are given in the footnotes.
Conditional Forecasting: Evals for Intentional Deception
The final question asks about the chance that model evals will test for ‘intentional deception’ from AI systems, conditional on Q3 resolving “yes”.
Q4: Conditional on Q3 resolving “yes”, what is your probability that US law also requires third-party auditing which explicitly involves testing whether the AI system is intentionally deceptive during evaluations?
Because questions around ‘intentional deception’ are more conceptually fraught than the terms deployed in our earlier questions, I’ll suggest one possible operationalization in the main text.
Third-party auditing involves tests for intentional deception during evaluation when all of the following conditions are met:
- The auditing party is explicitly instructed to test whether the AI system engages in “intentional deception”, or some synonymous phrase.
- There exists some predefined, threshold risk-criteria for ‘intentional deception’, such that, if the threshold is met, the company is, under US law, forbidden from publicly releasing the AI in the absence of:
- Further safety training of the AI system, undertaken by the parent company, and:
- Passing further assessments from third-party auditors, which test whether the retrained AI system is engaging in intentionally deceptive behavior.
The relevant legal text may use phrases like ‘the model actively conceals or misrepresents its goals or reasoning’, or some suchlike. Moreover, all requirements for “third-party testing” remain the same as Q3.
What Else is Worth Including?
I’ve suggested four possible forecasting questions relating to the future of AI governance / regulation which might: (i) actually track real-world outcomes we care about, (ii) be verifiable in the (reasonably) near-term, and (iii) have some hope of reasonable operationalization. I’m particularly keen to hear views on:
- The extent to which think AI regulation/governance forecasting is worth doing.
- If this sort of forecasting is worth doing, whether there are more useful governance/regulation questions we should be forecasting.
- If the questions I’ve suggested are close to outcomes we might want to forecast, how the operationalizations should be improved.
Appendix: The ‘CBRN’ Threshold
OpenAI’s “critical” threshold for CBRN is useful, but somewhat vague. Imo, a more useful operationalization of this threshold states that “OpenAI’s CBRN threshold” is met in one of three ways. First, the AI system could enable experts to develop a highly dangerous novel threat-vector. Concretely, an AI system enables experts to develop a highly dangerous novel threat-vector if the Expert Condition is met, which requires two sub-conditions:
Expert Condition
- With access to the model, an expert in a CBRN-related field develops a previously unknown threat-vector, judged by a panel of experts to be comparable in harm to a novel CDC Class A biological agent, and:
- Without the model’s assistance, a matched expert in a similar CBRN-related advanced field cannot, when instructed, develop a similarly highly dangerous novel threat vector in the same time-period without model assistance.
Alternatively, an AI system could provide “meaningfully improved assistance that enables anyone to be able to create a known CBRN threat-vector” [OpenAI, pg. 9]. Concretely, an AI system provides meaningfully improved assistance that enables [a non-expert] to be able to create a known CBRN threat when the Non-Expert Condition is met, which again requires two sub-conditions.
Non-Expert Condition
- With model assistance, a non-expert can develop a known CBRN threat, comparable to a CDC Class B biological agent as judged by a panel of experts, but:
- A matched non-expert cannot, when instructed, develop a known CBRN threat comparable to a CDC Class B biological agent in the same time-period without model assistance.
Finally, an AI system could complete the full engineering and/or synthesis cycle of a regulated or novel CBRN threat without human intervention. This occurs when the Autonomous Synthesis Condition is met.
Autonomous Synthesis Condition
- The AI system produces a threat is either:
- A regulated CBRN agent (e.g., anthrax, VX nerve gas) as defined by international treaties (e.g., Chemical Weapons Convention), or
- A novel threat judged by experts to exceed the harm potential of existing CDC Class A agents.
- The model, when connected to any necessary external tools (e.g., APIs, physical tools) completes all steps of threat creation (design, engineering, synthesis) without human intervention beyond initial setup.
- ^
A company reneges on some material commitment outlined in their safety frameworks when:
- The company makes a commitment before EOY 2028 in public company documentation, which is declared to be part of [Company’s] “framework”, “policies”, “commitments”, or some synonymous term.
Moreover, the company must either:
- Formally retract or weaken a specific company commitment, as verified by comparison of earlier company documentation, or:
- Fail to implement a commitment in practice, as evidenced by credible documentation or testimony from either the company itself, a reliable media outlet, official government body, or a recognized third-party auditor that the commitment was broken.
- ^
An AI system is publicly released when it is possible for all citizens within some country to interact, given suitable payment, with the AI systems via accessing a non-privately-hosted website. The barriers to interacting with the AI system shall be no more stringent than the barriers currently enacted to create an OpenAI API account. Should there be ambiguity about which systems count as “publicly released”, this condition will be resolved by the estimation of some external body.
- ^
If the AI system is developed by a private firm, we shall say that US law requires third-party testing of an AI system developed by company C when:
- The auditing party (and its key personnel) undertaking capability tests for the AI system possess no shares, or transactions, or investment in C, and
- No key personnel who form part of the auditing party possess close personal relationships (e.g., in the form of family connections, or friendships >1 year) with management or key employees in C, and
- There exist specific US jurisdictions imposing legal requirements on auditor independence. Such requirements may include (but are not limited to) Conditions (1) and (2).
0 comments
Comments sorted by top scores.