Train for incorrigibility, then reverse it (Shutdown Problem Contest Submission)

post by Daniel_Eth · 2023-07-18T08:26:20.687Z · LW · GW · 1 comments

1 comment

1 comments

Comments sorted by top scores.

Things that I seem to notice about the plan:

Adjusting weights a plan for basic AIs, which can't seek to e.g. be internally consistent, eventually landing wherever the attractors take it.
Say, you manage to give your AI enough quirks for it to go cry in a corner. Now you need to lower your AI nerfing to get more intelligence, leading to brinkmanship dynamics.
In the middle, you have a bunch of AI, trained for maximum of various aspects of incorrigibility, hoping they are incapable of cooperating; or for that any single AI will not act destructively (while trained for incorrigibility).