self-improvement-executors are not goal-maximizers

post by bhauth · 2023-06-01T20:46:16.964Z · LW · GW · 0 comments

Contents

No comments

Agents are adaptation-executors [? · GW] with adaptations that accomplish goals, not goal-maximizers. Understanding agents as maximizing goals is a simplification used by humans to make them easier to understand. This is as true when the goal is self-improvement as it is with anything else.

Let's think about self-improvement in terms of goal specification. You want to accomplish a goal G, so as you add levels, the task becomes more indirect:

Some people think that you can just tell S2 to do G and it will figure things out, but those people are interpreting S2 as a goal-maximizer, and agents are adaptation-executors. You have to specify the adaptations that will cause the actions that create a suitable S1. Reinforcement learning is a method (not the only method!) for specifying adaptations that accomplish a certain goal in the training environment, not a way of giving a system an inherent goal. I do think AI self-improvement is possible - I wrote this [LW · GW]. So, I do think in some sense that specification of S2 can be easier than direct specification of S0, but I'm not under any illusions that it's the same problem.

I'm posting this because I saw this post [LW · GW] (from yesterday) saying that people should try designing AIs that want to be shut down as a tripwire for dangerous recursive self-improvement, and that idea is based on the above misunderstanding. Also, "shut yourself down" is a particularly poor goal for 2 reasons.

  1. It's binary. Training ML systems works much better when you can indicate partial accomplishment of the goal.

  2. It's ill-defined. If you try to train a system to "shut itself down" what you are training it to do is "take provided action X that also coincidentally causes shutdown". You then need to make things harder if you're testing development of more-capable systems, and if you then take away X but leave other shutdown route Y, it's no more likely to do Y than anything else. Having negative average rewards over time doesn't train it to do Y until it does Y a bunch and gets reinforcement learning on doing Y.

0 comments

Comments sorted by top scores.