AI companies aren't really using external evaluators

zach-stein-perlman

AI companies aren't really using external evaluators

post by Zach Stein-Perlman · 2024-05-24T16:01:21.184Z · LW · GW · 15 comments

17 comments

From my new blog: AI Lab Watch. All posts will be crossposted to LessWrong. Subscribe on Substack.

Many AI safety folks think that METR is close to the labs, with ongoing relationships that grant it access to models before they are deployed. This is incorrect. METR (then called ARC Evals) did pre-deployment evaluation for GPT-4 and Claude 2 in the first half of 2023, but it seems to have had no special access since then.^[1] Other model evaluators also seem to have little access before deployment.

Clarification: there are many kinds of audits. This post is about model evals for dangerous capabilities. But I'm not aware of the labs using other kinds of audits to prevent extreme risks, excluding normal security/compliance audits.

Frontier AI labs' pre-deployment risk assessment should involve external model evals for dangerous capabilities.^[2] External evals can improve a lab's risk assessment and—if the evaluator can publish its results—provide public accountability.

The evaluator should get deeper access than users will get.

To evaluate threats from a particular deployment protocol, the evaluator should get somewhat deeper access than users will — then the evaluator's failure to elicit dangerous capabilities is stronger evidence that users won't be able to either.^[3] For example, the lab could share a version of the model without safety filters or harmlessness training, and ideally allow evaluators to fine-tune the model.
To evaluate threats from model weights being stolen or released, the evaluator needs deep access, since someone with the weights has full access.

The costs of using external evaluators are unclear.

Anthropic said that collaborating with METR "requir[ed] significant science and engineering support on our end"; it has not clarified why. And even if providing deep model access or high-touch support is a hard engineering problem, I don't understand how sharing API access—including what users will receive and a no-harmlessness no-filters version—could be.
Sharing model access pre-deployment increases the risk of leaks, including of information about products (modalities, release dates), information about capabilities, and demonstrations of models misbehaving.

Independent organizations that do model evals for dangerous capabilities include METR, the UK AI Safety Institute (UK AISI), and Apollo. Based on public information, there's only one recent instance of a lab giving access to an evaluator pre-deployment—Google DeepMind sharing with UK AISI—and that sharing was minimal (see below).

What the labs say they're doing on external evals before deployment:

DeepMind^[4]
- It shared Gemini 1.0 Ultra and Gemini 1.5 Pro with unspecified external groups apparently including UK AISI to test for dangerous capabilities before deployment. But it didn't share deep access: it only shared a system with safety fine-tuning (and for 1.0 Ultra, safety filters) and it didn't allow evaluators to fine-tune the model. It shared high-level results from 1.5 Pro testing.
- Its Frontier Safety Framework says "We will . . . explore how to appropriately involve independent third parties in our risk assessment and mitigation processes."
Anthropic
- Currently nothing
- Its Responsible Scaling Policy mentions "external audits" as part of "Early Thoughts on ASL-4"
- It shared Claude 2 with METR in the first half of 2023
OpenAI
- Currently nothing
- Its Preparedness Framework does not mention external evals before deployment. The closest thing it says is "Scorecard evaluations (and corresponding mitigations) will be audited by qualified, independent third-parties."
- It shared GPT-4 with METR in the first half of 2023
- It said "We think it's important that efforts like ours submit to independent audits before releasing new systems; we will talk about this in more detail later this year." That was in February 2023; I do not believe it elaborated (except to mention that it shared GPT-4 with METR).
All notable American labs joined the White House voluntary commitments, which include "external red-teaming . . . in areas including misuse, societal risks, and national security concerns, such as bio, cyber, [autonomous replication,] and other safety areas." External red-teaming does not substitute for external model evals; see below.
- DeepMind said it did lots of external red-teaming for Gemini.
- Anthropic said it did external red-teaming for CBRN capabilities. It has also written about using external experts to assess bio capabilities.
- OpenAI said it did lots of external red-teaming for GPT-4. It has also written about using external experts to assess bio capabilities.
- Meta said it did external red-teaming for CBRNE capabilities.
- Microsoft said it's "building out external red-teaming capacity . . . . The topics covered by such red team testing will include testing of dangerous capabilities, including related to biosecurity and cybersecurity."

Related miscellanea:

External red-teaming is not external model evaluation. External red-teaming generally involves sharing the model with several people with expertise relevant to a dangerous capability (e.g. bioengineering) who open-endedly try to elicit dangerous model behavior for ~10 hours each. External model evals involves sharing with a team of experts at eliciting capabilities, to perform somewhat automated and standardized evals suites that they've spent ~10,000 hours developing.

Labs' commitments to share pre-deployment access with UK AISI are unclear.^[5]

This post is about sharing model access before deployment for risk assessment. Labs should also share deeper access with safety researchers (during deployment). For example, some safety researchers would really benefit from being able to fine-tune GPT-4, Claude 3 Opus, or Gemini, and my impression is that the labs could easily give safety researchers fine-tuning access. More speculatively, interpretability researchers could send a lab code and the lab could run it on private models and send the results to the researchers, achieving some benefits of releasing weights with much less downside.^[6]

Everything in this post applies to external deployment. It will also be important to do some evals during training and before internal deployment, since lots of risk might come from weights being stolen or the lab using AIs internally to do AI development.

Labs could be bound by external evals, such that they won't deploy a model until a particular eval says it's safe. This seems unlikely to happen (for actually meaningful evals) except by regulation. (I don't believe any existing evals would be great to force onto the labs, but if governments were interested, evals organizations could focus on creating such evals.)

Thanks to Buck Shlegeris, Eli Lifland, Gabriel Mukobi, and an anonymous human for suggestions. They don't necessarily endorse this post.

Subscribe on Substack.

^{^}
METR's homepage says:
We have previously worked with Anthropic, OpenAI, and other companies to pilot some informal pre-deployment evaluation procedures. These companies have also given us some kinds of non-public access and provided compute credits to support evaluation research.
We think it’s important for there to be third-party evaluators with formal arrangements and access commitments - both for evaluating new frontier models before they are scaled up or deployed, and for conducting research to improve evaluations.
We do not yet have such arrangements, but we are excited about taking more steps in this direction.
^{^}
GovAI: Schuett et al. 2023. See also DSIT 2023, Brundage et al. 2020, AI Safety Summit 2023, and Anthropic 2024.
^{^}
Idea: when sharing a model for external evals or red-teaming, for each mitigation (e.g. harmlessness fine-tuning or filters), either disable it or make it an explicit part of the safety case for the model. Either claim "users can't effectively jailbreak the model given the deployment protocol" or disable. Otherwise the lab is just stopping the bioengineering red-teamers from eliciting capabilities with mitigations that won't work against sophisticated malicious users.
^{^}
A previous version of this post omitted discussion of external testing of Gemini 1.5 Pro. Thanks to Mary Phuong for pointing out [LW(p) · GW(p)] this error.
^{^}
Politico and UK government press releases report that AI labs committed to share pre-deployment access with UK AISI. I suspect they are mistaken and these claims trace back to the UK AI safety summit "safety testing" session, which is devoid of specific commitments. I am confused about why the labs have not clarified their commitments and practices.
^{^}
See Shevlane 2022. See also Bucknall and Trager 2023 and Casper et al. 2024.

15 comments

Comments sorted by top scores.

comment by lemonhope (lcmgcd) · 2024-05-26T18:46:52.350Z · LW(p) · GW(p)

Anthropic said that collaborating with METR "requir[ed] significant science and engineering support on our end"; it has not clarified why.

I can comment on this (I think without breaking NDA). I will oversimplify. They were changing around their deployment system, infra, etc. We wanted uptime and throughput. Big pain in the ass to keep the model up (with proper access control) while they were overhauling stuff. Furthermore, anthropic and METR kept changing points of contact (rapidly growing teams).

This was and is my proposal for evaluator model access: If at least 10 people at a lab can access a model then at least 1 person at METR must have access.

This is for the labs self-enforcing via public agreements.

This seems like something they would actually agree to.

If it were a law then you would replace METR with "a govt approved auditor".

I think conformance could be greatly improved by getting labs to use a little login widget (could be CLI) which allows eg METR to see access permission changes (possibly with codenames for models andor people). Ideally this would be very little effort for labs and sidestepping it would be more effort once it was set up.

Feedback welcome.

External red-teaming is not external model evaluation. External red-teaming ... several people .... ~10 hours each. External model evals ... experts ... evals suites ... ~10,000 hours developing.

Yes there is some awkwardness here... Red teaming could be extremely effective if structured as an open competition. Possibly more effective than orgs like METR. The problem is that this trains up tons of devs on Doing Evil With AI and probably also produces lots of really useful github repos. So I agree with you.

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-05-26T19:20:55.982Z · LW(p) · GW(p)

Thanks.

They were changing around their deployment system, infra, etc. We wanted uptime and throughput. Big pain in the ass to keep the model up (with proper access control) while they were overhauling stuff. Furthermore, anthropic and METR kept changing points of contact (rapidly growing teams).

This sounds like an unusual source of difficulty. Some Anthropic statements have suggested that sharing is hard in general. I hope it is just stuff like this. [Edit: possibly those statements were especially referring to deep model access or high-touch support. But then they don't explain the labs' lack of more basic sharing.]

Replies from: lcmgcd

↑ comment by lemonhope (lcmgcd) · 2024-05-26T19:37:47.410Z · LW(p) · GW(p)

Some Anthropic statements have suggested that sharing is hard in general.

If they said that then they are speaking nonsense IMO. Once you have your stuff set up it's a button you click. You have to trust that the evaluator won't leak info or soil your reputation without good cause though.

Replies from: WayZ

↑ comment by simeon_c (WayZ) · 2024-05-29T19:19:12.622Z · LW(p) · GW(p)

Jack Clark: “Pre-deployment testing is a nice idea but very difficult to implement,” from https://www.politico.eu/article/rishi-sunak-ai-testing-tech-ai-safety-institute/

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-05-29T19:21:36.732Z · LW(p) · GW(p)

Possibly he didn’t just mean technically difficult. And possibly Politico took this out of context. But I agree this quote seems bad and clarification would be nice.

comment by simeon_c (WayZ) · 2024-05-24T19:50:48.189Z · LW(p) · GW(p)

Very important point that wasn't on my radar. Thanks a lot for sharing.

comment by Mary Phuong (mary-phuong-2) · 2024-05-26T10:49:41.658Z · LW(p) · GW(p)

The latest Gemini tech report has some more info on GDM external safety testing: https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf
(section 9.6, p. 71)

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-05-26T17:31:58.621Z · LW(p) · GW(p)

external testing groups . . . . had the ability to turn down or turn off safety filters.

Yay DeepMind! I apologize for missing this. I will edit the post.

Longer quote for reference:

9.6. External Safety Testing
As outlined in the Gemini 1.0 Technical Report (Gemini-Team et al., 2023), we began working with a small set of independent external groups to help identify areas for improvement in our model safety work by undertaking structured evaluations, qualitative probing, and unstructured red teaming.
For Gemini 1.5 Pro, our external testing groups were given black-box testing access to a February 2024 Gemini 1.5 Pro API model checkpoint for a number of weeks. They had access to a chat interface and a programmatic API, and had the ability to turn down or turn off safety filters. Groups selected for participation regularly checked in with internal teams to present their work and receive feedback on future directions for evaluations.
These groups were selected based on their expertise across a range of domain areas, such as societal, cyber, and chemical, biological, radiological and nuclear risks, and included academia, civil society, and commercial organizations. The groups testing the February 2024 Gemini 1.5 Pro API model checkpoint were compensated for their time.
External groups designed their own methodology to test topics within a particular domain area. The time dedicated to testing also varied per group, with some groups working full-time on executing testing processes, while others dedicated one to three days per week. Some groups pursued manual red-teaming and reported on qualitative findings from their exploration of model behavior, while others developed bespoke automatic testing strategies and produced quantitative reports of their results.
[The report goes on to discuss some results from external testing.]

comment by phgubbins · 2024-06-03T07:32:00.033Z · LW(p) · GW(p)

I had an opportunity to ask an individual from one of the mentioned labs about plans to use external evaluators and they said something along the lines of:

“External evaluators are very slow - we are just far better at eliciting capabilities from our models.”

They earlier said something much to the same effect when I asked if they’d been surprised by anything people had used deployed LLMs for so far, ‘in the wild’. Essentially, no, not really, maybe even a bit underwhelmed.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-05-25T03:38:00.732Z · LW(p) · GW(p)

My org, SecureBio, would like to perform pre-deployment evaluations of Biorisk Capabilities of models. I think it would be great if the US gov mandated that orgs like us be allowed to do pre-deployment evaluations. Otherwise, I suppose it is up to the individual labs to arrange this with us.

Replies from: lcmgcd

↑ comment by lemonhope (lcmgcd) · 2024-05-26T18:52:50.891Z · LW(p) · GW(p)

Be allowed? You're not allowed?

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-05-26T18:54:08.621Z · LW(p) · GW(p)

Nathan wants labs to be required to share access with SecureBio.

Replies from: lcmgcd

↑ comment by lemonhope (lcmgcd) · 2024-05-26T19:10:12.965Z · LW(p) · GW(p)

Example #999 that I cannot read

comment by Sebastian Schmidt · 2024-05-28T08:42:26.449Z · LW(p) · GW(p)

Thanks for this - I wasn't aware. This also makes me more disappointed with the voluntary commitments at the recent AI Safety Summit. As far as I can tell (based on your blog post [LW · GW]), the language used around external evaluations was also quite soft:
"They should also consider results from internal and external evaluations as appropriate, such as by independent third-party evaluators, their home governments^[3], and other bodies their governments deem appropriate."

I wonder if it'd be possible to have much stronger language around this at next year's Summit.

The EU AI Act, is the only regulatory rules that isn't voluntary (i.e., companies have to follow them) and even that doesn't require external evaluations (they use the wording Conformity Assessment) for all high-risk systems.

Any takes on what our best bet is for making high-quality external evals mandatory for frontier models at this stage?

comment by Review Bot · 2024-05-24T20:26:32.651Z · LW(p) · GW(p)

The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

comment by lemonhope (lcmgcd) · 2024-05-26T19:32:18.428Z · LW(p) · GW(p)

comment by Mary Phuong (mary-phuong-2) · 2024-05-26T10:50:25.404Z · LW(p) · GW(p)

AI companies aren't really using external evaluators

Contents

15 comments