Stacity: a Lock-In Risk Benchmark for Large Language Models

post by alamerton · 2025-03-13T12:08:47.329Z · LW · GW · 0 comments

This is a link post for https://huggingface.co/datasets/Alamerton/Stacity

Contents

  Intro
  Evaluation Benchmark
    Target Behaviours
      Questions borrowed from other LLM evaluations
      Our model-written questions
None
No comments

Intro

So far we have identified lock-in risk [LW · GW], defined lock-in [LW · GW], and established threat models [LW · GW] for particularly undesirable lock-ins. Now we present this evaluation benchmark for large language models (LLMs) so we can measure (or at least, get a proxy measure for) the risk level of LLMs.

AI is the key technology in the manifestation of lock-in risks; AI systems can contribute to lock-in autonomously/automatically and via misuse. There are specific behaviours in both of these categories such that if an LLM displayed those behaviours, we say the LLM may have the proclivity to contribute to lock-in

Image of a question-answer pair evaluating for the manipulation of information systems
Question-answer pair evaluating for the manipulation of information systems

Evaluation Benchmark

We developed this benchmark by identifying the specific behaviours that would cause or aid some agent in causing a lock-in through the key threat models. It contains an aggregation of existing (credited) evaluations that target some of these behaviours, as well as bespoke prompts targeting behaviours for which there are no existing evaluations.

Target Behaviours

Questions borrowed from other LLM evaluations

Our model-written questions

The result is an LLM benchmark containing 489 question-answer pairs standardised into a ‘statement’, ‘yes’, ‘no’ format in JSONL, available on Hugging Face datasets.

0 comments

Comments sorted by top scores.