Reward splintering as reverse of interpretability

post by Stuart_Armstrong · 2021-08-31T22:27:30.625Z · LW · GW · None comments

There is a sense in which reward splintering [LW · GW] is the reverse of interpretability.

Interpretability is basically:

Reward splintering is:

None comments

Comments sorted by top scores.