5. Open Corrigibility Questions

post by Max Harms (max-harms) · 2024-06-10T14:09:20.777Z · LW · GW · 0 comments

Contents

  Remaining Confusion
  Suggested Future Research
    Training Corrigible Models
    Testing Corrigibility Understanding in Humans
    Other Experiments
None
No comments

(Part 5 of the CAST sequence [? · GW])

Much work remains on the topic of corrigibility and the CAST strategy in particular. There’s theoretical work in both nailing down an even more complete picture of corrigibility and in developing better formal measures. But there’s also a great deal of empirical work that seems possible to do at this point. In this document I’ll attempt to give a summary of where I, personally, want to invest more energy.

Remaining Confusion

And of course there are general questions such as “What are the best interpretability methods?” or “What (competitive) architectures give the best alignment guarantees?” or “How can we get the right people to govern AI?” which are interesting avenues of research that explicitly slot into the CAST agenda, even if they’re not about corrigibility per se.

Suggested Future Research

Training Corrigible Models

One of the most promising avenues for research on corrigibility is to attempt to train a CAST LLM or another such AI model that is reachable with currently available levels of compute. I’ve done the lowest-hanging fruit of producing a Corrigibility Training Context which gets ChatGPT to be able to roughly talk about what I mean by corrigibility, but this would involve attempting to produce corrigible behaviors, rather than simply being able to abstractly identify what a corrigible agent would do. If one had a suitable dataset in hand, such work could probably be done in a weekend with less than a couple thousand dollars by fine-tuning an existing open-source model. I don’t know how to evaluate the difficulty of constructing a suitable dataset; plausibly something like constitutional AI could be employed by bootstrapping from an existing model with the right context to bypass the need for hand-collected data.

If we had a current-generation “corrigible” AI, I would be very excited to test whether it could reconstruct desiderata [AF · GW] that had been carefully excluded from its training data. It also seems promising to test how well the AI can avoid falling into behavior which is helpful, obedient, conservative, or otherwise, but not corrigible per se. In other words: what are the main distractors, in practice? How does such a model act on edge cases?

If more resources are available, I’d be interested in attempting to craft a corrigibility benchmark and seeing how the performance of fine-tuned models scales with compute, especially when compared with other metrics for intelligence. My notion that corrigibility is a simple concept seems testable here, in that if relatively stupid models are able to pick up on the heart of the idea, it seems like strong evidence that more intelligent models will also be able to identify the property and optimize for it.

It also seems worth testing how much CAST buys compared to models trained to be both corrigible and have other nice properties such as “being harmless.” How does having the singular focus affect pre-AGI models on the corrigibility benchmark? Do we see incorrigible behavior creeping in, when the two properties come into conflict? How do mixed-goal LLMs tend to respond to the prospect of having their values change?

Testing Corrigibility Understanding in Humans

One of the more exciting prospects for testing the concept of corrigibility, from my perspective, doesn’t involve AI models at all. Instead, it seems possible to me to gather data about how natural, simple, and coherent corrigibility is, as a concept, by measuring humans in game/quiz settings. Survey participants, volunteers, students, or Amazon Mechanical Turk workers could all be enlisted to, after being given short explainers on the property of corrigibility (and a comprehension test to ensure they have the basic idea), play a game where they’re presented with a situation and asked how a purely and perfectly corrigible agent would behave. These participants might be instructed that their response will be graded solely on how corrigible it is, rather than how good/helpful/friendly/etc. it is, and if they score highly they’re eligible for a bonus prize or something.

Do random humans understand corrigibility? What are the biggest sources of confusion? If it’s straightforward to teach humans about corrigibility, this is evidence that it’s a simple/natural/coherent concept. If multiple judges (perhaps including LLM judges?) are asked to independently score people’s answers for how to respond corrigibly in a given situation, do high-scoring answers tend to agree, or are there multiple different responses that we might see as corrigible? Agreement implies coherence, whereas disagreement would be a strong blow against CAST, implying that the idea of corrigibility is probably less real than it seems.

How do humans who score highly on producing corrigible responses handle edge-cases? What are the most confusing aspects of corrigibility from this perspective (rather than from the perspective of common-misunderstandings)? This sort of data might be invaluable for testing AI agents, or refining our understanding of how to formalize corrigibility.

Other Experiments

If you have any ideas for experiments, please suggest them! If you have feedback, questions, or just otherwise want to talk about CAST, please leave a comment or contact me at max@intelligence.org. In addition to resolving confusion and doing experiments, I think one of the most useful next-steps is distillation and attempting to communicate these ideas in a more accessible way. That’s where I’m planning to put energy next, as well as engaging with feedback from the community. We’re all in this together, and I want our understanding and presentation of corrigibility to be as strong and clear-minded as possible, presented side-by-side with the sharpest critiques, such that we can collectively avoid disaster. 🚀

0 comments

Comments sorted by top scores.