A dataset of questions on decision-theoretic reasoning in Newcomb-like problems

post by Caspar Oesterheld (Caspar42), Ethan Perez (ethan-perez), Chi Nguyen · 2024-12-16T22:42:03.763Z · LW · GW · 1 comments

This is a link post for https://arxiv.org/abs/2411.10588

Contents

1 comment

I’ve spent a lot of the last few years working on issues related to acausal cooperation. With LLMs being clearly dominant over recent years, I’ve now led a team to make a benchmark to figure out how good LLMs are at decision theory and whether and when they lean more CDT or EDT. We hope to expand this dataset in the future, including by incorporating questions that try to measure the updatelessness dimension. Hopefully, this dataset will be useful for future interventions aimed at improving acausal interactions.

Abstract:

We introduce a dataset of natural-language questions in the decision theory of so-called Newcomb-like problems. Newcomb-like problems include, for instance, decision problems in which an agent interacts with a similar other agent, and thus has to reason about the fact that the other agent will likely reason in similar ways. Evaluating LLM reasoning about Newcomb-like problems is important because interactions between foundation-model-based agents will often be Newcomb-like. Some ways of reasoning about Newcomb-like problems may allow for greater cooperation between models.

Our dataset contains both capabilities questions (i.e., questions with a unique, uncontroversially correct answer) and attitude questions (i.e., questions about which decision theorists would disagree). We use our dataset for an investigation of decision-theoretical capabilities and expressed attitudes and their interplay in existing models (different models by OpenAI, Anthropic, Meta, GDM, Reka, etc.), as well as models under simple prompt-based interventions. We find, among other things, that attitudes vary significantly between existing models; that high capabilities are associated with attitudes more favorable toward so-called evidential decision theory; and that attitudes are consistent across different types of questions.

Twitter thread:

How do LLMs reason about playing games against copies of themselves? 🪞We made the first LLM decision theory benchmark to find out. 🧵

Decision theory tackles questions of rational choice, especially in interactions with copies or simulations of yourself. Rare for humans but potentially very important for language models!

Our team, which includes academic decision theory researchers, spent hundreds of hours hand-generating 400+ multiple-choice questions to test how well LLMs reason about two key decision theories: causal and evidential. We also made 100+ qs to test which theory LLMs prefer.

Weaker models, including some versions of GPT 3.5, got <50% right on our benchmark – barely better than random guessing.

Cutting-edge models perform better but are far from perfect. OpenAI’s o1 leads with ~75% accuracy. We expect human experts to score nearly 100%.

Models varied on which decision theory they prefer. Surprisingly, better performance on our capabilities benchmark was correlated with preferring evidential over causal decision theory (with chain of thought).

This is puzzling – there’s no human expert consensus on which decision theory is better.

We found that model attitudes are consistent between theoretical and pragmatic questions: Models that recommend EDT-aligned actions also tend to give more EDT-aligned answers on abstract questions.

How well LLMs follow which decision theory affects their ability to cooperate. This could mean the difference between peace and conflict in AI-assisted political bargaining or enable AIs to collude when one is meant to monitor the other, undermining human control.

Our dataset opens the door to studying what shapes models’ decision theories. It also lets us test whether changing which theory models endorse affects their real-life decisions. To learn more, read the full paper: https://arxiv.org/abs/2411.10588

1 comments

Comments sorted by top scores.

comment by Akash (akash-wasil) · 2024-12-17T20:15:09.266Z · LW(p) · GW(p)

How well LLMs follow which decision theory affects their ability to cooperate. This could mean the difference between peace and conflict in AI-assisted political bargaining or enable AIs to collude when one is meant to monitor the other, undermining human control.

Do you have any thoughts on "red lines" for AI collusion? That is, "if an AI could do X, then we should acknowledge that AIs can likely collude with each other in monitoring setups."