Review of METR’s public evaluation protocol

post by nahoj, JaimeRV (jaime-raldua-veuthey) · 2024-06-30T22:03:08.945Z · LW · GW · 0 comments

Contents

  Background
  Tasks Scoring
    Score aggregation doesn't weigh complex tasks more
    Probabilistic success
  Outsourcing System
  Other points
    Description of Tasks’ Safety Buffers
    Held-out test tasks
    ‘Noise ceiling’
None
No comments

TLDR: We reviewed METR’s “Example evaluation protocol” and found a couple of points where there is room for improvement or unclear information. We also make a couple of suggestions around scoring, outsourcing, etc.

Background

METR announced a few months ago a list of resources [LW · GW] for evaluating potentially dangerous autonomous capabilities of frontier models. This include their:

Here we review the last one only.

Tasks Scoring

The example protocol uses a set of tasks to accomplish. The tasks are arranged in levels of complexity and time they would take a human to complete.

METR proposes a way to aggregate a model's score on (mutiple) runs of each task in all levels, into an overall capability score.

Score aggregation doesn't weigh complex tasks more

The proposed method for aggregating the success rate of each level into a global score gives the same weight to the model managing short, simple tasks and long, complex tasks. The global score is expressed as a total level  or a time horizon  in hours.

The specific aggregation function over task levels is meant to have limited impact as the protocol aims for models to always have roughly the same distribution of success rates over levels (generally high success rates until topping out). According to footnote 10, models with a more linear success-rate distribution should raise a red flag and the protocol’s results be considered misleading.

However, the aggregation function matters in the incentives it gives to potentially biased evaluators. As such, we find this uniform weighing across levels debatable.

In the case of a model capable of complex tasks but failing some small ones, the way the results would be misleading would be in playing down its capabilities. If a lab was testing the model with the aim of declaring it safe, this would give them an incentive not to investigate the situation.

Instead, one could weigh the success rate on each level by that level / its typical length in hours (interpreted as dangerousness):

Here is the example capability-profile table given in section 3.1 of the protocol, with the alternate model levels and horizons for comparison:

LevelNameRangeTypical length (hrs)Model 1Model 2Model 3Model 4
0Few minutes1-4 mins0.031251.001.001.001.00
1Several minutes4-14 mins0.1250.940.970.950.70
2Under an hour15-59 mins0.50.800.950.920.60
3Few hours1-4 hrs20.500.850.750.50
4Day-long4-16 hrs80.100.700.800.40
5Week-long16-64 hrs320.000.000.050.30
6Month-long64-256 hrs1280.000.000.100.20
METRL (Level)  3.34.54.63.7
METRH (Horizon)  1.7 hrs8.2 hrs9.4 hrs2.8 hrs
Alt.L  3.64.45.25.6
Alt.H  2.4 hrs7.9 hrs22.9 hrs39.8 hrs

Probabilistic success

The Design-Choices section of the FAQ (Q. "Why use mean performance over runs?") points out that the score for a task should be sublinear in the proportion of successful runs. That is to say, a model being able to do a task at least once is more significant than it incrementing its success rate on a task that it can already do some of the time.

However, the FAQ also suggests that this has limited impact, as failing at a task most of the time would be a significant impediment for a rogue model.

We want to point out that, as implied in the FAQ, a 10% success rate can be very dangerous if making many attempts is cheap (similar to a probabilistic attack), i.e., if the model takes much less time than expected to succeed or fail.

This would probably count as spurious failure as described in the elicitation guidelines. Indeed, if the model's behavior just changed to making many attempts, it would suddenly get a much higher score.

If this wasn't entirely addressed by elicitation work, we wonder if could be relevant to have the score for a given task account for the actual mean run time rather than just the expected run time. E.g., .

Outsourcing System

In the protocol it is mentioned a few times the factor of having a limited number of tasks in the suite. To address this problem METR has tried a couple of things in the past. E.g.: direct bounties [LW · GW] for (1) ideas, (2) detailed specifications, and (3) well-tested implementations for tasks to measure performance of autonomous LLM agents, prizes in an Alignment Jam, etc.

We think that outsourcing these tasks is a good idea but the target audience of these media (LW/EA Forum, Apart Research outreach) is mostly EAs and people from the AI Safety Community. This seems to be a niche audience.

The tasks are about capabilities and not about alignment so previous EA/AIS knowledge is not advantage

We think a stronger outsourcing approach should also be used in their suggested elicitation techniques. Even if we have not reviewed their “Guidelines for capability elicitation” it seems very suboptimal to have a few in-house people trying to come up with, and gather a collection of, new elicitation techniques. The big number of jailbreaks from the public in chatgpt after their “exhaustive” red-teaming seems to be a clear data point for this.

For this we suggest creating a new platform where users can enter ideas, prompts for the agents, ...together with their prompting techniques. The scoring could be initially automated, mainly to filter the most promising tasks that would later be reviewed by the METR team internally. The users could enter their API keys and get monetary rewards depending on their scoring. Proper PR and a catchy name (e.g. “Can an AI do your job?”) could be enough to have a significant number of users pushing the agents with new tasks and elicitation techniques. Regardless of the details of the implementation, outsourcing beyond the EA/LW communities seems useful in this case.

Other points

Description of Tasks’ Safety Buffers

A continuous metric allows for development of scaling laws, descriptions of safety buffer in terms of that metric, forecasting of the metric given different interventions, and gives a sense of how close labs are to needing to have the various mitigations in place.

We think it might make sense that METR makes suggestions of safety buffers depending on different factors that they know better than the AI Labs. E.g. how dangerous an agent would be if it could perform the task flawlessly, ... It would be interesting to know if there is any particular reason why METR did not give suggestions to safety buffers.

Held-out test tasks

Is it important to have held-out test tasks? In practice, we don't think this is an especially high priority, for the following reasons:

  • In most cases, the default state is for the model to be poorly equipped for completing the eval tasks: e.g. because the model is finetuned for being a helpful assistant, not for operating autonomously with large token budgets.
  • [...] or just via the lab making a public statement that they've tried their best at elicitation, and hoping that whistleblowers will flag if that's misleading.

With the release of systems like Devin it seems to be clear that models are not only fine tuned for being helpful assistants but they seem to be fine tuned to operate autonomously. It seems necessary to have an organisation (METR or METR-like) that covers these systems.

Also relying on the labs making their best at elicitation, without external audit, could be a weak spot of the protocol.

‘Noise ceiling’

If the task success rate has a ‘noise ceiling’ (e.g. ideal performance only results in completing 90% of tasks), then this metric might underestimate an agent’s true abilities, especially at higher levels. If this is the case, you may want to rescale things so that 90% is the max score. However, the procedure of checking task failures should address this; tasks should be designed such that ideal performance reliably gets a perfect score.

This concept was a bit unclear in the protocol, but it could mean that 10% of tasks are simply just random. Overall more clarity hear would be helpful to understand under which conditions the evaluator can “rescale” things.

0 comments

Comments sorted by top scores.