[Preprint] Pretraining Language Models with Human Preferences
post by Giulio (thesofakillers) · 2023-02-21T11:44:27.423Z · LW · GW · 0 commentsThis is a link post for https://arxiv.org/abs/2302.08582
Contents
No comments
Surprised no one posted about this from Anthropic, NYU and Uni of Sussex yet:
- Instead of fine-tuning on human-preferences, they directly incorporate human feedback in the pre-training phase, conditioning the model on <good> or <bad> feedback tokens placed at the beginning of the training sequences.
- They find this to be Pareto-optimal out of five considered pre-training objectives, greatly reducing the amount of undesired outputs while retaining standard LM pre-training downstream performance AND outperforming RLHF fine-tuning in terms of preference satisfaction.
This conditioning is very reminiscent of the decision transformer, where scalar reward tokens are prepended to the input. I believe CICERO also does something similar, conditioning on ELO scores during dialogue generation training.
From a discussion with James Chua [LW · GW] on AISS's slack, we noted similarities between this work and Charlie Steiner [LW · GW]'s Take 13: RLHF bad, conditioning good [LW · GW]. James is developing a library ("conditionme") specifically for rating-conditioned language modelling and was looking for some feedback, which prompted the discussion. We figured potential future work here is extending the conditioning to scalar rewards (rather than the discrete <good> vs <bad>), which James pointed out requires some caution with the tokenizer, which he hopes to address in part with conditionme.
0 comments
Comments sorted by top scores.