Naive self-supervised approaches to truthful AI

post by ryan_greenblatt · 2021-10-23T13:03:01.369Z · LW · GW · 4 comments

Useful background: TruthfulQA [LW · GW]

Consider the following approach to (possibly) make a pretrained generative language model (like GPT-3) more truthful:

This extremely naive approach has the advantage of requiring no dataset curation or human labeling. It does require a dataset of questions, but that may be easier to arrange. Presumably this sort of very weak self-consistency enforcement/pseudolabeling results in little improvement on truthfulness. However, I don't have much confidence in what the results would like. It seems likely that the model would learn to adapt the style of answers to appear more truthful to itself, but I don't have any sense of how much actual improvement in truthfulness there would be. Further, I would wonder if any improvement on truthfulness would be limited to the set of questions used for training or if truthfulness learned in this way would generalize. For example, how much more would training on TruthfulQA questions improve performance vs training on a set of unrelated questions? I think that answers to these questions have a small but reasonable chance to result in some weak updates on approaches to truthful AI.

I am planning on doing some fast experiments along these lines (probably working with a friend of mine). If I do so, I will post a followup with results. I'm curious if anyone is aware of prior experiments along these lines or has any ideas for related schemes or questions.

I can also think of some other similar self-supervised/self-play schemes and extensions which may be worth some experimentation:

  1. Like the approach used in TruthfulQA. Harmful few-shot prompts consist of examples of questions answered like a conspiracy theorist (or other types of poor answers which can be found in the original training distribution). Helpful few-shot prompts consist of questions answered truthfully and in the desired style. ↩︎


Comments sorted by top scores.

comment by gwern · 2021-10-24T18:55:05.995Z · LW(p) · GW(p)

Train the question answering model to have its answers labeled as truthful more often (likely via RL).

You don't necessarily need RL. Just self-distill ie. finetune on the accepted completions.

Replies from: Charlie Steiner
comment by Charlie Steiner · 2021-10-28T22:59:07.326Z · LW(p) · GW(p)

Yeah I think there's not a benefit to being fancy. Except maybe if you can actively sample datapoints the model is most confused about - but even then is it worth it with actual GPT to stop and update before generating new samples? I think the more parameters you have, the less doing this makes sense, because your finetuning only has to move a short distance in a very high-dimensional space.

This raises a question I don't have an intuition for, though, which is how big a divergence from GPT you get if you train from scratch while trying to enforce this sort of self-supervised constraint.

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2021-10-29T09:04:46.282Z · LW(p) · GW(p)

I would imagine that if you have a limited question pool used for self-supervision, then applying this constraint while training from scratch would result in overfitting with less generalization (but I'm not super confident in this, and there might be descent ways to avoid this).

If the question pool is very large/generated or the constraint is generally enforced on text generation (I'm not sure this makes much sense), then this might do something interesting.

I don't have the resources to run an experiment like this at the moment (particularly not with a very large model like GPT-J).

comment by rohinmshah · 2021-10-24T17:12:11.667Z · LW(p) · GW(p)

I have heard of similar experiments that did in fact help, though I don't have any citations (in many cases it is unpublished work). So I think with some effort I do expect you to get some benefit from such an approach.