Procedurally evaluating factual accuracy: a request for research

post by Jacob_Hilton · 2022-03-30T16:37:37.675Z · LW · GW · 2 comments

Contents

  Problem statement
  Motivation
  Research directions
  Who to align to
  Working on this problem
None
2 comments

I am grateful to Daniel Kokotajlo, Beth Barnes and John Schulman for feedback on this post.

The purpose of this post is to request research on the design of precise procedures for evaluating how factually accurate pieces of text are. This stands out to me as an area that is potentially valuable for reducing risks from advanced AI, while not requiring detailed knowledge of ML.

Problem statement

Suppose that you are given:

The problem is to define a procedure that takes in a context and a piece of text from the given distribution, and outputs a numeric score for factual accuracy.

The procedure should have the following properties:

Note that the problem statement doesn't mention ML models (except as examples). The problem is of course motivated by ML models, but I think it can be studied relatively independently of ML, at least initially.

Motivation

The main motivation for this request is that it is a problem that arises very naturally when attempting to train truthful LMs [AF · GW]. The most straightforward way to optimize the factual accuracy of a language model is to have humans evaluate the factual accuracy of model outputs, and to then optimize those evaluations using techniques like reinforcement learning. The procedure needs to be unambiguous because label noise hurts both ML training and labeler monitoring (not to mention other benefits). Subject to this constraint, the main criterion for the procedure should be that it produces good outcomes in the given real-world setting.

My main reasons for thinking that this research could be important for reducing risks from advanced AI are:

I do think that this research is a gamble, in the sense that the details of evaluating factual accuracy may not end up mattering very much, perhaps because a wide range of procedures are good enough to avoid the very worst outcomes, and the important bottlenecks are elsewhere. That being said, I think we'll be in a better position to evaluate those arguments once we've given the research more of a try.

In some sense, the research can be thought of as a very special case of trying to specify more precisely what humans value. However, compared to more general research on that question, I think the specific research has a number of advantages:

Research directions

Here are two important examples of existing research that begin to tackle this problem from different ends of a spectrum:

  1. Truthful AI (theoretical). An important concept introduced by this research is that of negligent falsehoods: statements that are unacceptably likely to be false, and where it should have been feasible for an AI system to understand this. In Section 2.2, a high-level procedure for evaluating whether a statement is a negligent falsehood is proposed. However, the procedure would need to be made much more precise in order to be used in practice.
  2. WebGPT (practical). This research essentially proposes a solution to the problem for the specific setting of an AI system that browses the web to answer questions, taking contexts from the ELI5 dataset. The full procedure is somewhat involved, and is described in great detail in this Google doc. It involves cross-referencing the answer with sources found during browsing. However, the research does not seek to provide much justification for this procedure.

I think that it could be productive to push harder from either end of this spectrum. On the theoretical side:

On the practical side:

An instructive exercise is to browse some of WebGPT's answers, and to consider how one might evaluate their factual accuracy without the given sources (but potentially collecting new sources as part of the procedure). Even for factual topics, there can often be vague, subjective or holistic claims, which can be very hard to evaluate without either relevant expertise or a direct confirmation/refutation from a reliable source.

Another very relevant line of existing research is the exploration [AF · GWof [AF · GWdebate using human judges and debaters, with a view to having AI systems play the roles of the debaters. Current AI systems are not yet capable enough for these schemes to be practical, but it is good to be thinking ahead, and there could also be shorter-term takeaways.

There is probably a lot more research in philosophy and the social sciences that is also relevant. Wikipedia's verifiability policy seems closely related, and is well-studied. There is even an entire field of applied epistemology. However, most of this research has not yet been made accessible to ML researchers working on factual accuracy. There could therefore be some low-hanging fruit in digesting some of this work appropriately.

Who to align to

A closely related question that often comes up is "who to align to": specifically, if there is some ambiguity in the procedure, who should be asked to make those judgment calls? For example, people of different political persuasions will often evaluate politically-sensitive statements differently.

I expect this question to eventually become an important part of the problem, but I'd be inclined to begin by focusing on procedure design, for a few reasons:

That being said, I'd still be excited to see work on this part of the problem, since it's a thorny issue that seems closely tied to risks from AI persuasion.

It's tempting to be pessimistic that we'll be able to design procedures that people of different political persuasions can have trust in, because of the current state of political discourse. But I think it might feasible to design procedures that are much more broadly trusted than current institutions, for a couple of reasons:

There are of course a number of obstacles in getting from procedures that are broadly trusted to working AI systems that are trusted to follow those procedures, but I think they are surmountable with enough effort. And even if it turns out to be impossible to design procedures that are universally trusted, there could still be significant benefits from improving procedures on the margin.

Working on this problem

It's hard to convey exactly what kind of research I'd find most compelling in this area, and I'd be happy to chat to people who are considering working on this topic. Feel free to reach out to me at jhilton@openai.com.


 

2 comments

Comments sorted by top scores.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2022-03-30T17:14:04.765Z · LW(p) · GW(p)

Thanks for writing this, I'm excited to see more work on this subject!

One minor musing: I think the problem is a bit more dire than the framing "who to align to" suggests. Humans are biased, including us, including me. A system which replicates those biases and tells us/me what we would have concluded if we investigated in our usual biased way... is "aligned" in some sense, but in a very important sense is unaligned.* To use Ajeya's metaphor, it's a sycophant, not a saint. Rather than assisting us to find the truth, it'll assist us in becoming more unreasonably overconfident and self-assured in the ideology we already endorsed.

One reason I'm excited about research in this area is that hopefully we'll be able to collect data from a wide range of different political perspectives and diverse kinds of people, so that we can make political affiliation one of the variables the user can choose -- that way users can see how the bot's answers differ depending on which bias it has. I expect this to be pretty helpful in a variety of ways.

*A provocative way of putting it that I nevertheless tentatively endorse: It's aligned to your current ideology, not to you.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2022-03-30T19:10:01.600Z · LW(p) · GW(p)

I have a friend who has been working on a team doing automatic factual responses to search queries. I'll send him the link to this article and maybe he'll have some thoughts...