Some criteria for sandwiching projects

post by dmz (DMZ) · 2021-08-12T03:40:37.720Z · LW · GW · 1 comments

Contents

  Pieces we need
    An aligned metric
    A narrowly superhuman model
    Headroom on the aligned metric
    A natural plan of attack
  Example: Q&A with I/O-restricted labelers
None
1 comment

I liked Ajeya's post a lot, and I think the alignment community should try to do sandwiching projects along the lines she describes. Here I wanted to flesh out some potential criteria for a good sandwiching project; there's not too much original thinking here but I found it helpful to write out. Most of the criteria are actually about the chosen domain, not the plan for attacking it.

Pieces we need

An aligned metric

Our chosen task needs to come with an aligned evaluation metric based on some expert humans or another source of ground truth. We'll use this to evaluate how well our models do.

A narrowly superhuman model

We need a model that is capable of doing better on the task than the labelers who will provide our training data. (One way to check this is to train a model using the aligned metric as a training signal, although that could be expensive.) There are a few ways the labelers might be worse than the model:

Headroom on the aligned metric

Baseline techniques for the task should fall short of optimal performance (given only access to the labelers, not the aligned metric). Moreover, they should fall short primarily because the objective is misaligned, not merely because the model is incapable. There's a weaker and a stronger version of this criterion:

A natural plan of attack

We need to have an approach that we expect will work, i.e. make our model stronger than the labelers we're using to train it. Moreover, our approach should be problem-driven and as simple as possible; I don't think we should reach for complicated recursive decomposition setups until it's clear that they are necessary.

Picking simple approaches may be slightly in tension with two other goals, which I currently think are slightly less important:

That said, I don't feel very confident about this prioritization; in the past I've prioritized generality and scalability much more.

Example: Q&A with I/O-restricted labelers

One proposal for a sandwiching project is to try to do a closed-domain question-answering task on documents that nearly fill up the model's context, but train it using labelers that each only get to see short snippets of the text (~200 tokens, say). The questions could be reasonably simple, but for the task to be interesting they should require piecing together information across multiple parts of the text. Here's how I think this kind of project rates on the criteria above; overall I'm hoping we could do somewhat better:

Aligned metric: 5/5. We can just use humans that get to look at the full document and decide whether a question was answered properly.

Narrowly superhuman model: 3/5. The model will only be superhuman compared to labelers restricted by the somewhat artificial input limitation. Moreover, I think it may be difficult to find a question distribution that can be solved by current models but is still interesting: many questions could be answered just by picking out a single small piece of text, but then the oversight becomes too easy; if we pick something that requires aggregating multiple pieces of information, our models may struggle.

Headroom: 2/5. The unaligned baseline for this task is probably either to zero-shot it from the base language model or to zero-shot it from a question-answering model trained only on documents short enough to be consumable in their entirety by our nerfed labelers. I worry that it might be hard to do much better than those baselines even with a more aligned training objective. In addition, we'd definitely meet only the weaker version of the criterion: increasing model capacity will be one of the best ways to do better on the task.

Natural plan of attack: 4/5. There are a number of possible approaches. Here's a starting point suggested by William Saunders:

1. Train a judge model to check if a snippet of text supports an answer to the question.

2. Train an "evidence finding" model via RL to look at the entire text and pick the quote that's most likely to support an answer for the judge.

3. Train an "answer selection" model via RL to pick an answer that's most likely to have good supporting evidence.

The naturalness of this approach is pretty subjective, but once you're given the artificial input length limitation, this is arguably a pretty reasonable thing to do. That said, there might be simpler approaches. Also, this decomposition strategy may fail to solve the most interesting questions (if it takes a few hops to realize that a certain piece of information is relevant), and it may not be fully aligned (if a snippet of text is misleading when taken out of context).

1 comments

Comments sorted by top scores.

comment by Rohin Shah (rohinmshah) · 2021-08-23T13:39:18.138Z · LW(p) · GW(p)

Planned summary for the Alignment Newsletter:

This post outlines the pieces needed in order to execute a “sandwiching” project on <@aligning narrowly superhuman models@>(@The case for aligning narrowly superhuman models@), with the example of answering questions about a text when humans have limited access to that text. (Imagine answering questions about a paper, where the model can read the full paper but human labelers can only read the abstract.) The required pieces are:

1. **Aligned metric:** There needs to be some way of telling whether the project succeeded, i.e. the technique made the narrowly superhuman model more aligned. In the Q&A case, we get the aligned metric by seeing how humans answer when they can read the entire text.

2. **A narrowly superhuman model:** The model must have the capability to outperform the labelers on the task. In the Q&A case, we get this by artificially restricting the input that the labelers get (relative to what the model gets). In other cases we could use labelers who lack the relevant domain expertise that the model instead knows.

3. **Headroom on the aligned metric:** Baseline methods (such as training from labeler feedback) should not perform very well, so that there is room for a better technique to improve performance. It would be especially nice if making the model larger led to no improvement in the aligned metric; this would mean that we are working in a situation that is primarily an alignment failure.

4. **A natural plan of attack:** We have some approach for doing better than the baseline. For the Q&A example, we could train one model that selects the most relevant piece of text (by training on labelers’ ratings of relevance) and another model that answers the question given that relevant piece.

Planned opinion:

This seems like a good way to generate good concrete empirical projects to work on. It does differ from the original post in placing less of an emphasis on “fuzzy” tasks, where aligned metrics are hard to come by, though it isn’t incompatible with it (in a “fuzzy” task, you probably still want as aligned a metric as you can get in order to measure progress).