0 comments

Comments sorted by top scores.

comment by Rachel Freedman (rachelAF) · 2022-12-11T03:58:41.869Z · LW(p) · GW(p)

Where does the reward in step 1 come from? Is it assigned by H? Is it determined by an outside observer? Is the reward function somehow hardcoded into the context?

Replies from: zachary-robertson

↑ comment by Past Account (zachary-robertson) · 2022-12-11T04:36:04.806Z · LW(p) · GW(p)

The reward is from the user which ranks candidate responses from ChatGPT. This is discussed more in OpenAI’s announcement. I edited the post to clarify this.

Replies from: rachelAF

↑ comment by Rachel Freedman (rachelAF) · 2022-12-11T19:28:37.587Z · LW(p) · GW(p)

Thanks for the clarification! From OpenAI's announcement, it looks like this ranking only occurs during the finetuning portion of training (Step 2). But the user doesn't have the opportunity to provide this feedback after deployment. So are you suggesting that ChatGPT gets aligned to the values of the human contractor(s) that provide data during finetuning, and then carries these values forward when interacting with users? I'm asking because one of the key benefits of CIRL games (also called "assistance games") is that they allow the AI to continuously update towards the user's values, without freezing for deployment, and I don't fully understand the connection here.

Replies from: zachary-robertson

↑ comment by Past Account (zachary-robertson) · 2022-12-11T19:56:35.927Z · LW(p) · GW(p)

So are you suggesting that ChatGPT gets aligned to the values of the human contractor(s) that provide data during finetuning, and then carries these values forward when interacting with users?

You are correct that this appears to stand in contrast one of the key benefits of CIRL games. Namely, that they allow the AI to continuously update towards the user's values. The argument I present is that ChatGPT can still learn something about the preferences of the user it is interacting with through the use of in-context value learning. During deployment, ChatGPT will then be able to learn preferences in-context allowing for continuous updating towards the user's values like in the CIRL game.