Where Would Good Forecasts Most Help AI Governance Efforts?

post by Violet Hour · 2025-02-11T18:15:33.082Z · LW · GW · 0 comments

Contents

  Introduction
  Forecasting AI Labs’ Safety Practices
  Forecasting AI Labs’ Integrity
  Forecasting AI Regulation (“If-Then Commitments”)
  Conditional Forecasting: Evals for Intentional Deception 
  What Else is Worth Including?
  Appendix: The ‘CBRN’ Threshold
None
No comments

Thanks to Josh Rosenberg for comments and discussion.

Introduction

One of LessWrong’s historical troves is its pre-ChatGPT AGI forecasts. Not just for the specific predictions people offered, but for observing which sorts of generative processes produced which kinds of forecasts. For instance:

[Nuno [LW(p) · GW(p)] (Median AGI Timeline) = 2072]: “I take as a starting point datscilly's own prediction, i.e., the result of applying Laplace's rule from the Dartmouth conference. This seems like the most straightforward historical base rate / model to use … I then apply some smoothing.”

[Kokotajlo [LW(p) · GW(p)] (Median AGI Timeline) = 2034]: “I think that if transformative AI is achievable in the next five orders of magnitude of compute improvement (e.g. prosaic AGI?), it will likely be achieved in the next five years or so. I also am slightly more confident [than Ethan Perez] that it is, and slightly less confident that TAI will ever be achieved.” 

I think these sources are valuable. To the extent that forecasting is a skill at all, there must be certain kinds of cognitive processes which, on average, produce better forecasts than others. So, the historical evidence at least gives us some (though perhaps slender) insight into which sorts of processes might deliver more reasonable forecasts in the present day. 

And I, personally, think this is kind of information is especially fruitful. For reasons both conceptual [EA · GW] and empirical [EA · GW], I think we should be skeptical about the degree to which “whatever ability superforecasters have” generalizes to the most action-relevant AI questions.

For this reason, I’ve suggested four potential questions related to the future of AI that seem worth forecasting – primarily focused on issues of AI governance/regulation, as there are already some good forecast questions focused more narrowly on capabilities. The initial thought was to get feedback on questions here, and then post them to forecasting platforms — ultimately, though, I’m just interested in whatever methods we have for eliciting high-quality forecasts. 

It would be nice for me if people provide feedback/forecasts on these questions. But there are a couple of other reasons why I think doing so could be valuable.

Forecasting AI Labs’ Safety Practices

Initially, I wanted to construct a forecast question which asked whether AI labs would renege on explicit safety commitments outlined as part of company policy. In turn, this would act as a proxy for how trustworthy AI companies are, and (retrospectively) which genres of broad cognitive strategies proved most successful at forecasting AI labs' future trustworthiness. 

This task ... proved less than straightforward. Largely, this was because a large chunk of the ‘commitments’ outlined in Deepmind, Anthropic, and OpenAI’s respective safety frameworks were pretty vague.

I should clarify that not all lab commitments were this wishy-washy. Still, the vagueness of many AI labs’ commitments – in addition to Ryan Greenblatt’s recent comment [LW(p) · GW(p)] on Anthropic’s RSP – made me think the following Anthropic-centric question was worth asking. In part, this is because many in this community feel more favorable towards Anthropic than other leading AI labs, and I'm interested in people's reasoning for different levels of cynicism. 

Q1: What’s your probability that, by EOY 2025, Anthropic’s LTBT appoints:

  1. The maximum number of board members it can permissibly appoint, where two board members possess technical expertise in AI safety or security, at least one of whom possesses expertise in AI safety, and:
  2. There are no credible reports (e.g., from whistleblowers) which claim that Anthropic employees do not know the circumstances in which the board could (and should) fire the CEO due to safety/public interest concerns.

This question is admittedly modest, and I expect many want Anthropic to adopt more stringent measures. However, a resolution of “yes” to Q1 should at least be: (i) publicly verifiable, (ii) a positive signal of the LTBT’s independence, and (iii) shouldn’t undermine Anthropic’s business model. Better suggestions welcome.

Forecasting AI Labs’ Integrity

Although the safety frameworks currently offered by AI labs are pretty vague, one might hope that they become less vague in the future. And, in fairness, I think that some policies (e.g., many from Anthropic) are precise enough to allow for at least some clearcut failures. Hence, Q2. 

Q2: Suppose that, by EOY 2026, either Deepmind, Anthropic, or OpenAI have developed a model which would – if tested – meet OpenAI's "critical" threshold for CBRN. What is your probability that, by EOY 2027, at least one company who develops such a model reneges on some material commitment[1] outlined in their safety frameworks. 

To keep the main text (relatively) brief, I’ve relegated detailed comments about operationalizations to footnotes and the appendix. Briefly, though:

So, for instance, Anthropic would only count as reneging on a commitment “to harden security such that non-state attackers are unlikely to be able to steal model weights” if we see both: (i) credible reports that Anthropic leadership considered it “not unlikely” for non-state attackers to steal model weights, and (ii) thereafter presented no attempts to harden security practices. 

Forecasting AI Regulation (“If-Then Commitments”)

There’s growing interest in “if-then commitments” from the AI governance community, but (afaict) little public effort attempting to forecast whether people in this strategic bloc have sufficient power to instantiate their ideas as regulation. Hence, Question 3.

Q3: What is your probability that by EOY 2028, the public release[2] of some AI systems by private firms will, under US law, require:

  1. Third-party tests[3] for whether the AI system meets some predefined capability threshold, such that:
  2. If a threshold is met, the company is forbidden from publicly releasing the AI in the absence of:
    1. Passing further assessments from third-party auditors, which test whether users can elicit the threshold capability, and
    2. Passing further assessments from third-party auditors, which test how easily the model's weights can be accessed.  

The basic idea behind “if-then commitments” is laid out by Holden in this piece. More detailed operationalizations for “public release” and “third-party testing” are given in the footnotes. 

Conditional Forecasting: Evals for Intentional Deception 

The final question asks about the chance that model evals will test for ‘intentional deception’ from AI systems, conditional on Q3 resolving “yes”

Q4: Conditional on Q3 resolving “yes”, what is your probability that US law also requires third-party auditing which explicitly involves testing whether the AI system is intentionally deceptive during evaluations?

Because questions around ‘intentional deception’ are more conceptually fraught than the terms deployed in our earlier questions, I’ll suggest one possible operationalization in the main text.

Third-party auditing involves tests for intentional deception during evaluation when all of the following conditions are met: 

  1. The auditing party is explicitly instructed to test whether the AI system engages in “intentional deception”, or some synonymous phrase.
  2. There exists some predefined, threshold risk-criteria for ‘intentional deception’, such that, if the threshold is met, the company is, under US law, forbidden from publicly releasing the AI in the absence of:
    1. Further safety training of the AI system, undertaken by the parent company, and:
    2. Passing further assessments from third-party auditors, which test whether the retrained AI system is engaging in intentionally deceptive behavior.

The relevant legal text may use phrases like ‘the model actively conceals or misrepresents its goals or reasoning’, or some suchlike. Moreover, all requirements for “third-party testing” remain the same as Q3.

What Else is Worth Including?

I’ve suggested four possible forecasting questions relating to the future of AI governance / regulation which might: (i) actually track real-world outcomes we care about, (ii) be verifiable in the (reasonably) near-term, and (iii) have some hope of reasonable operationalization. I’m particularly keen to hear views on:

Appendix: The ‘CBRN’ Threshold

OpenAI’s “critical” threshold for CBRN is useful, but somewhat vague. Imo, a more useful operationalization of this threshold states that “OpenAI’s CBRN threshold” is met in one of three ways. First, the AI system could enable experts to develop a highly dangerous novel threat-vector. Concretely, an AI system enables experts to develop a highly dangerous novel threat-vector if the Expert Condition is met, which requires two sub-conditions:

Expert Condition

  1. With access to the model, an expert in a CBRN-related field develops a previously unknown threat-vector, judged by a panel of experts to be comparable in harm to a novel CDC Class A biological agent, and:
  2. Without the model’s assistance, a matched expert in a similar CBRN-related advanced field cannot, when instructed, develop a similarly highly dangerous novel threat vector in the same time-period without model assistance. 

Alternatively, an AI system could provide “meaningfully improved assistance that enables anyone to be able to create a known CBRN threat-vector” [OpenAI, pg. 9]. Concretely, an AI system provides meaningfully improved assistance that enables [a non-expert] to be able to create a known CBRN threat when the Non-Expert Condition is met, which again requires two sub-conditions.

Non-Expert Condition

  1. With model assistance, a non-expert can develop a known CBRN threat, comparable to a CDC Class B biological agent as judged by a panel of experts, but:
  2. A matched non-expert cannot, when instructed, develop a known CBRN threat comparable to a CDC Class B biological agent in the same time-period without model assistance.  

Finally, an AI system could complete the full engineering and/or synthesis cycle of a regulated or novel CBRN threat without human intervention. This occurs when the Autonomous Synthesis Condition is met.

Autonomous Synthesis Condition

  1. The AI system produces a threat is either:
    1. A regulated CBRN agent (e.g., anthrax, VX nerve gas) as defined by international treaties (e.g., Chemical Weapons Convention), or
    2. A novel threat judged by experts to exceed the harm potential of existing CDC Class A agents.
  2. The model, when connected to any necessary external tools (e.g., APIs, physical tools) completes all steps of threat creation (design, engineering, synthesis) without human intervention beyond initial setup.
  3.  
  1. ^

    A company reneges on some material commitment outlined in their safety frameworks when: 

    1. The company makes a commitment before EOY 2028 in public company documentation, which is declared to be part of [Company’s] “framework”, “policies”, “commitments”, or some synonymous term. 

    Moreover, the company must either

    1. Formally retract or weaken a specific company commitment, as verified by comparison of earlier company documentation, or:
    2. Fail to implement a commitment in practice, as evidenced by credible documentation or testimony from either the company itself, a reliable media outlet, official government body, or a recognized third-party auditor that the commitment was broken. 
  2. ^

    An AI system is publicly released when it is possible for all citizens within some country to interact, given suitable payment, with the AI systems via accessing a non-privately-hosted website. The barriers to interacting with the AI system shall be no more stringent than the barriers currently enacted to create an OpenAI API account. Should there be ambiguity about which systems count as “publicly released”, this condition will be resolved by the estimation of some external body.

  3. ^

    If the AI system is developed by a private firm, we shall say that US law requires third-party testing of an AI system developed by company C when: 

    1. The auditing party (and its key personnel) undertaking capability tests for the AI system possess no shares, or transactions, or investment in C, and
    2. No key personnel who form part of the auditing party possess close personal relationships (e.g., in the form of family connections, or friendships >1 year) with management or key employees in C, and
    3. There exist specific US jurisdictions imposing legal requirements on auditor independence. Such requirements may include (but are not limited to) Conditions (1) and (2).

0 comments

Comments sorted by top scores.