Ben Amitay's Shortform

post by Ben Amitay (unicode-70) · 2023-07-15T11:17:06.697Z · LW · GW · 2 comments

2 comments

Comments sorted by top scores.

comment by Ben Amitay (unicode-70) · 2023-07-15T11:17:06.785Z · LW(p) · GW(p)

I had an idea for fighting goal misgeneralization. Doesn't seem very promising to me, but does feel close to something interesting. Would like to read your thoughts:

  1. Use IRL to learn which values are consistent with the actor's behavior.
  2. When training the model to maximize the actual reward, regularize it to get lower scores according to the values learned by the IRL. That way, the agent is incentivized to signal not having any other values (and somewhat incentivized agains power seeking)
Replies from: unicode-70
comment by Ben Amitay (unicode-70) · 2023-07-15T11:27:17.038Z · LW(p) · GW(p)

I probably don't understand the shortform format, but it seem like others can't create top-level comments. So you can comment here :)