JoNeedsSleep's Shortform
post by JoNeedsSleep (joanna-j-1) · 2024-10-24T04:50:44.976Z · LW · GW · 1 commentsContents
1 comment
1 comments
Comments sorted by top scores.
comment by JoNeedsSleep (joanna-j-1) · 2024-10-24T04:50:45.251Z · LW(p) · GW(p)
The distinction between inner and outer alignment is quite unnatural. For example, even the concept of reward hacking implies the double-fold failure of a reward that is not robust enough to exploitation, and a model that develops instrumental capabilities as to find a way to trick the reward; indeed, in the case of reward hacking, it's worth noting that depending on the autonomy of the system in question, we could attribute the misalignment as inner or outer. At its core, this distinction comes out of the policy <-> reward scheme of RL, though prediction <-> loss function in SL can be similarly characterized; I doubt how well this framing generalizes to other engineering choices.