Human-AI Complementarity: A Goal for Amplified Oversight

post by rishubjain, Sophie Bridgers (sbridgers) · 2024-12-24T09:57:55.111Z · LW · GW · 3 comments

This is a link post for https://deepmindsafetyresearch.medium.com/human-ai-complementarity-a-goal-for-amplified-oversight-0ad8a44cae0a

Contents

3 comments

By Sophie Bridgers, Rishub Jain, Rory Greig, and Rohin Shah
Based on work by the Google DeepMind Rater Assist Team: Vladimir Mikulik, Sophie Bridgers, Tian Huey Teh, Rishub Jain, Rory Greig, Lili Janzer (randomized order, equal contributions)

 

Human oversight is critical for ensuring that Artificial Intelligence (AI) models remain safe and aligned to human values. But AI systems are rapidly advancing in capabilities and are being used to complete ever more complex tasks, making it increasingly challenging for humans to verify AI outputs and provide high-quality feedback. How can we ensure that humans can continue to meaningfully evaluate AI performance? An avenue of research to tackle this problem is “Amplified Oversight [AF · GW]” (also called “Scalable Oversight”), which aims to develop techniques to use AI to amplify humans’ abilities to oversee increasingly powerful AI systems, even if they eventually surpass human capabilities in particular domains.

With this level of advanced AI, we could use AI itself to evaluate other AIs (i.e., AI raters), but this comes with drawbacks (see Section IV: The Elephant in the Room). Importantly, humans and AIs have complementary strengths and weaknesses. We should thus, in principle, be able to leverage these complementary abilities to generate an oversight signal for model training, evaluation, and monitoring that is stronger than what we could get from human raters or AI raters alone. Two promising mechanisms for harnessing human-AI complementarity to improve oversight are:

  1. Rater Assistance, in which we give human raters access to an AI rating assistant that can critique or point out flaws in an AI output or automate parts of the rating task, and
  2. Hybridization, in which we combine judgments from human raters and AI raters working in isolation based on predictions about their relative rating ability per task instance (e.g., based on confidence).

The design of Rater Assistance and/or Hybridization protocols that enable human-AI complementarity is challenging. It requires grappling with complex questions such as how to pinpoint the unique skills and knowledge that humans or AIs possess, how to identify when AI or human judgment is more reliable, and how to effectively use AI to improve human reasoning and decision-making without leading to under- or over-reliance on the AI. These are fundamentally questions of Human-Computer Interaction (HCI), Cognitive Science, Psychology, Philosophy, and Education. Luckily, these fields have explored these same or related questions, and AI safety can learn from and collaborate with them to address these sociotechnical challenges. On our team, we have worked to expand our interdisciplinary expertise to make progress on Rater Assistance and Hybridization for Amplified Oversight.

 

Read the rest of the full blog here!

3 comments

Comments sorted by top scores.

comment by Charlie Steiner · 2024-12-25T01:20:39.695Z · LW(p) · GW(p)

Nice! Purely for my own ease of comprehension I'd have liked a little more translation/analogizing between AI jargon and HCI jargon - e.g. the phrase "active learning" doesn't appear in the post.

  • Value Alignment: Ultimately, humans will likely need to continue to provide input to confirm that AI systems are indeed acting in accordance with human values. This is because human values continue to evolve. In fact, human values define a “slice” of data where humans are definitionally more accurate than non-humans (including AI). AI systems might get quite good at predicting what aligned behavior should be in out-of-distribution scenarios, but it’s unlikely that AI will be able to figure out what humans want in completely new situations without humans being consulted and kept in the loop.

I disagree in several ways.

  • Humans being definitionally accurate is a convenient assumption on easy problems, but breaks down on hard problems. The thing is, human responses to questions are not always best thought of as direct access to some underlying truth - we give different answers in different contexts, and have to be interpreted in sometimes quite sophisticated ways to turn those question-responses into good choices between actions in the real world. There are even cases where humans will give some answer on the object level, but when asked a meta-level question about their object-level answer will disagree with themselves (perhaps even endorsing some non-human process including AI). If humans were always accurate this would be a paradox.
  • AI is going to get smart. Eventually, quite smart. Smart enough to predict human behavior in new contexts kind of smart. On the one hand this is good news because it means that if we can reduce moral questions to empirical questions about the behavior of humans in novel contexts (and I mean do it in a philosophically satisfying way, not just try something that sounds good and hope it works), we're almost done. On the other hand this is bad news because it means that AI ignorance about human behavior cannot be used to ensure properties like corrigibility, and predictions of future AI-human interaction based on assumptions of AI ignorance have to be abandoned.
Replies from: sbridgers
comment by Sophie Bridgers (sbridgers) · 2025-01-10T13:15:36.243Z · LW(p) · GW(p)

Hi Charlie, Thanks for your thoughtful feedback and comments! If we may, we think we actually agree more than we disagree. By “definitionally accurate”, we don’t necessarily mean that a group of randomly selected humans are better than AI at explicitly defining or articulating human values or better at translating those values into actions in any given situation. We might call this “empirical accuracy” – that is, under certain empirical conditions such as time pressure, expertise and background of the empirical sample, incentive structure of the empirical task, the dependent measure, etc. humans can be inaccurate about their underlying values and the implications of those values for real-world decisions. Rather by “definitional accuracy”, we mean that for something to be a human value, it needs to actually be held by humans and for an AI action or decision to be aligned with human values, it needs to be deemed desirable by humans. That is, if no human agrees with or endorses an AI action as being appropriate – under the best case empirical conditions – then it definitionally is not in line with human values. And thus human input will be needed in some capacity to verify or confirm alignment. (On reflection, we can see how the term “accuracy” in this context may be misleading. We could instead have stated: “humans are definitionally the source of information about human values, even if it may be challenging to elicit this information from humans.”)

Now, what this human input looks like could (and probably should) go beyond introspection and preference judgments, which, as you point out, can be unreliable. It could instead involve expert judgment from humans with diverse cultural backgrounds, deliberation and/or negotiation, incentives to encourage deep, reflective thinking rather than snap judgments or falling back on heuristics. It could also involve AI assistance to help counter human biases, find common ground, and consider the logical consequences of communicated values. 

Let’s think about this a bit more concretely. Imagine humans are defining a constitution for an AI to follow. Rather than having humans sit down and generate the constitution from scratch, perhaps it would be better for carefully selected and trained humans to answer and deliberate about a series of questions, which the AI then uses to translate into a proposed constitution as well as how this constitution would translate into actions in different scenarios. The humans could then review this proposal. Disagreement with the proposal could elucidate values that the AI did not pick up on from the initial human input, or trade-offs between values or exceptions that humans failed to communicate. The humans could provide critiques that the AI could use to revise the constitution. The AI could also push back on critiques if they seem logically inconsistent with the initial input or with other critiques, requiring humans to re-examine their reasoning. Contrast this process with one in which the AI comes up with its own constitution without input from humans, whose values it is intended to represent.  We contend that even as AI gets really smart, humans ultimately need to be in the loop to determine whether or not a constitution is aligned and reasonable.

~ Sophie Bridgers (on behalf of the authors)

Replies from: Charlie Steiner
comment by Charlie Steiner · 2025-01-11T03:53:47.702Z · LW(p) · GW(p)

Thanks for the great reply :) I think we do disagree after all.

humans are definitionally the source of information about human values, even if it may be challenging to elicit this information from humans

Except about that - here we agree.

 

Now, what this human input looks like could (and probably should) go beyond introspection and preference judgments, which, as you point out, can be unreliable. It could instead involve expert judgment from humans with diverse cultural backgrounds, deliberation and/or negotiation, incentives to encourage deep, reflective thinking rather than snap judgments or falling back on heuristics. It could also involve AI assistance to help counter human biases, find common ground, and consider the logical consequences of communicated values. 

This might be summarized as "If humans are inaccurate, let's strive to make them more accurate." 

I think this, as a research priority or plan A, is doomed by a confluence of practical facts (humans aren't actually that consistent, even in what we'd consider a neutral setting) and philosophical problems (What if I think the snap judgments and heuristics are important parts of being human? And, how do you square a univariate notion of 'accuracy' with the sensitivity of human conclusions to semi-arbitrary changes to e.g. their reading lists, or the framings of arguments presented to them?).

Instead, I think our strategy should be "If humans are inconsistent and disagree, let's strive to learn a notion of human values that's robust to our inconsistency and disagreement."

We contend that even as AI gets really smart, humans ultimately need to be in the loop to determine whether or not a constitution is aligned and reasonable.

A committee of humans reviewing an AI's proposal is, ultimately, a physical system that can be predicted. If you have an AI that's good at predicting physical systems, then before it makes an important decision it can just predict this Committee(time, proposal) system and treat the predicted output as feedback on its proposal. If the prediction is accurate, then actual humans meeting in committee is unnecessary.

(And indeed, putting human control of the AI in the physical world actually exposes it to more manipulation than if the control is safely ensconced in the logical structure of the AI's decision-making.)