Train for incorrigibility, then reverse it (Shutdown Problem Contest Submission)
post by Daniel_Eth · 2023-07-18T08:26:20.687Z · LW · GW · 1 commentsContents
1 comment
1 comments
Comments sorted by top scores.
comment by Maybe_a · 2023-07-19T16:11:17.481Z · LW(p) · GW(p)
Things that I seem to notice about the plan:
- Adjusting weights a plan for basic AIs, which can't seek to e.g. be internally consistent, eventually landing wherever the attractors take it.
- Say, you manage to give your AI enough quirks for it to go cry in a corner. Now you need to lower your AI nerfing to get more intelligence, leading to brinkmanship dynamics.
- In the middle, you have a bunch of AI, trained for maximum of various aspects of incorrigibility, hoping they are incapable of cooperating; or for that any single AI will not act destructively (while trained for incorrigibility).