[Preprint] Pretraining Language Models with Human Preferences

post by Giulio (thesofakillers) · 2023-02-21T11:44:27.423Z · LW · GW · 0 comments

This is a link post for https://arxiv.org/abs/2302.08582

Contents

No comments

Surprised no one posted about this from Anthropic, NYU and Uni of Sussex yet:

This conditioning is very reminiscent of the decision transformer, where scalar reward tokens are prepended to the input. I believe CICERO also does something similar, conditioning on ELO scores during dialogue generation training.

From a discussion with James Chua [LW · GW] on AISS's slack, we noted similarities between this work and Charlie Steiner [LW · GW]'s Take 13: RLHF bad, conditioning good [LW · GW]. James is developing a library ("conditionme") specifically for rating-conditioned language modelling and was looking for some feedback, which prompted the discussion. We figured potential future work here is extending the conditioning to scalar rewards (rather than the discrete <good> vs <bad>), which James pointed out requires some caution with the tokenizer, which he hopes to address in part with conditionme.

0 comments

Comments sorted by top scores.