Reward splintering as reverse of interpretability

post by Stuart_Armstrong · 2021-08-31T22:27:30.625Z · LW · GW · 0 comments

Contents

No comments

There is a sense in which reward splintering [LW · GW] is the reverse of interpretability.

Interpretability is basically:

Reward splintering is:

0 comments

Comments sorted by top scores.