Why Reflective Stability is Important

post by Johannes C. Mayer (johannes-c-mayer) · 2024-09-05T15:28:19.913Z · LW · GW · 2 comments

Contents

2 comments

Imagine you have the optimal AGI source code . If you run  the resulting AGI process will not self-modify.[1] You get a non-self-modifying AI for free.

This would be good because a non-self-modifying AI is much easier to reason about. 

Imagine some non-optimal AGI source code . Somewhere in , there is a for loop that performs some form of program search. The for loop is hardcoded to perform exactly 100 iterations. When designing  you proved that it is impossible for the program search to find a malicious program that could take over the whole system. However, this proof makes the assumption that the for loop only performs exactly 100 iterations.

Assume that when performing more iterations in the program, it gives you better results. Then it seems plausible that an AGI would self-modify by increasing the number of iterations, increasing the risk of takeover.[2]

 If you don't have the optimal AGI source code, you need to put in active effort in order to ensure that the AGI will not self-modify, to ensure that your reasoning about the system doesn't break. Note that then the "don't perform self-modification" constraint is another part of your system that can break under optimization pressure.

My model about why people think that decision theory is important says: If we could endow the AGI with the optimal decision theory it would not self-modify, making it easier to reason about it.

If your AGI uses a bad decision theory  it would immediately self-modify to use a better one. Then all your reasoning that uses features from  goes out the window.

I also expect that decision theory would help you reason about how to robustly make a system non-self-modifying, but I am unsure about that one.

I am probably missing other reasons why people care about decision theory.

  1. ^

    This seems roughly right, but it's a bit more complicated than this. For example, consider the environment where it is optimal to change your source code according to a particular pattern, as otherwise you get punished by some all-powerful agent.

  2. ^

    One can imagine an AI that does not care whether it will be taken over, or one that can't realize that it would be taken over.

2 comments

Comments sorted by top scores.

comment by Seth Herd · 2024-09-05T19:06:56.908Z · LW(p) · GW(p)

Agreed that this is critically important. See my post The alignment stability problem [LW · GW] and my 2018 chapter Goal changes in intelligent agents.

I also think that prosaic alignment generally isn't considering the critical importance of reflective stability. I just wrote some about that in the post I just put up on how Conflating value alignment and intent alignment is causing confusion [LW · GW]. 

I tend to think that it's not so much decision theory that matters, but rather how that particular agent makes decisions. Although if one decision theory does turn out to just be more rational, we might see different starting mechanisms converge on that as AGI self-improves to ASI.

comment by Anon User (anon-user) · 2024-09-05T22:48:25.860Z · LW(p) · GW(p)

If your AGI uses a bad decision theory T it would immediately self-modify to use a better one.

Nitpick - while probably a tiny part of the possible design space, there are obvious counterexamples to that, including when using T results in the AGI [incorrectly] concluding T is the best, or otherwise not realizing this self-modification is for the best.