Reward splintering as reverse of interpretability
post by Stuart_Armstrong · 2021-08-31T22:27:30.625Z · LW · GW · 0 commentsContents
No comments
There is a sense in which reward splintering [LW · GW] is the reverse of interpretability.
Interpretability is basically:
- "This algorithm is doing something really complicated; nevertheless, I want a simple model that explains essentially what it is doing. If there is no obvious simple model, I want an explanatory model to be taught to me with the least amount of complexity, distortion, or manipulation."
Reward splintering is:
- "Here is my simple model of what the algorithm should be doing. I want the algorithm to essentially do that, even if its underlying behaviour is really complicated. If it must deviate from this simple model, I want it to deviate in a way that has the least amount of complexity, distortion, or manipulation."
0 comments
Comments sorted by top scores.