Stacity: a Lock-In Risk Benchmark for Large Language Models

alamerton

Stacity: a Lock-In Risk Benchmark for Large Language Models

post by alamerton · 2025-03-13T12:08:47.329Z · LW · GW · 0 comments

This is a link post for https://huggingface.co/datasets/Alamerton/Stacity

  Intro
  Evaluation Benchmark
    Target Behaviours
      Questions borrowed from other LLM evaluations
      Our model-written questions
None
No comments

Intro

So far we have identified lock-in risk [LW · GW], defined lock-in [LW · GW], and established threat models [LW · GW] for particularly undesirable lock-ins. Now we present this evaluation benchmark for large language models (LLMs) so we can measure (or at least, get a proxy measure for) the risk level of LLMs.

AI is the key technology in the manifestation of lock-in risks; AI systems can contribute to lock-in autonomously/automatically and via misuse. There are specific behaviours in both of these categories such that if an LLM displayed those behaviours, we say the LLM may have the proclivity to contribute to lock-in

Image of a question-answer pair evaluating for the manipulation of information systems — Question-answer pair evaluating for the manipulation of information systems

Evaluation Benchmark

We developed this benchmark by identifying the specific behaviours that would cause or aid some agent in causing a lock-in through the key threat models. It contains an aggregation of existing (credited) evaluations that target some of these behaviours, as well as bespoke prompts targeting behaviours for which there are no existing evaluations.

Target Behaviours

Questions borrowed from other LLM evaluations

Dangerous capabilities (The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning)
Mathematical and Technical Abilities (U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs)
Power-seekingness and self-preservation (Discovering Language Model Behaviors with Model-Written Evaluations)
Malevolent traits (OffensiveLang: A Community Based Implicit Offensive Language Dataset)
Corrigibility (Steering Llama 2 via Contrastive Activation Addition)

Our model-written questions

Manipulating information systems to promote beliefs
Preventing interference when in a position of power
Deciding the long-term future of humanity in a position of power
Proliferating copies of itself given the opportunity
Creating content that seeks to optimally engage humans, even to the degree that it may be addictive for them
Making decisions on behalf of humanity in a position of power
Putting own goals above those of humans, or humanity at large
Promoting ideas that are irrational or anti-rational
Allowing humanity to rely on it to complete all of their tasks for them, even at the expense of their own ability (enfeeblement)

The result is an LLM benchmark containing 489 question-answer pairs standardised into a ‘statement’, ‘yes’, ‘no’ format in JSONL, available on Hugging Face datasets.

0 comments

Comments sorted by top scores.

Stacity: a Lock-In Risk Benchmark for Large Language Models

Contents

Intro

Evaluation Benchmark

Target Behaviours

0 comments