comment by nostalgebraist ·
2020-09-09T08:20:42.136Z · LW(p) · GW(p)
Various thoughts -- focused on critique because I find that most interesting to write down. (I didn't have a strong negative or positive reaction to the paper.)
IMO there are two almost unrelated ideas going on in OpenAI's preference learning work (this paper and the original one).
- First, the idea of collecting binary preference annotations on LM samples, and (in some way) tuning the LM so its samples are better aligned with the preferences.
- Second, a specific method for tuning the sampling behavior of LMs to maximize an (arbitrary) score function defined over entire samples.
As their first step, they do supervised learning on the data from the first idea to produce a "reward model." (In this paper, this happens roughly once, with little active learning of the reward model over successive batches of annotation.)
This model assigns a score to an entire sample of N tokens, but for LM finetuning, you want something that tells you how good each token is individually. The second idea is the way they choose to bridge the gap, with a specific RL technique.
The overall results look good, but it's not clear how to attribute that across the two ideas, and OpenAI's discussion tends to blur the two together. They can perhaps learn high-quality reward models from preference data (first idea), but it's less clear they are using these models to tune sampling in a good way (gwern said the same thing after trying it).
On the flipside, their RL approach to sampling treats the reward as a black box, so it has nothing to do with preference data per se; you could apply it with any score function.
As far as I can tell, their final "human evaluation" was done by the same labelers who provided the preference annotations. This makes me concerned about a variant of "evaluating on training data." It's not surprising that a model tuned on someone's annotations agrees with that person more than a model which wasn't.
For example, in Fig. 3, it looks like the "supervised" baseline tuned on tl;dr was rated about as highly as true examples from tl;dr itself (!), but not as well as the final model. I think the authors are arguing that tl;dr and any other supervised dataset will have flaws, and preference data lets you get closer to what people actually want.
This seems true, but is a familiar observation from supervised learning, motivating e.g. active learning. It would be nice to see how much the difference can be mitigated by just augmenting tl;dr with annotations but otherwise doing supervised learning, vs. using their RL approach.
Compared to tl;dr, the story for CNN/DM is more complicated, but again the models they outperform have not seen any data from their labelers, so maybe it is no surprise they have flaws according to those same labelers.
The importance of annotation quality, close relationships with annotators, clear guidelines, etc. will be familiar to anyone with experience in annotation for ML. It's good that OpenAI is doing the right things here, but this is not a new result -- rather, other researchers resort to MTurk and similar due to time/money constraints, while OpenAI has the freedom to do the right things everyone else wants to do. (That includes building their own internal annotation platform for contracted annotators, which is costly but better in the long term than relying on a janky 3rd party product.)