0 comments

Comments sorted by top scores.

comment by leogao · 2022-06-17T02:46:47.720Z · LW(p) · GW(p)

If I understand correctly, this is the idea presented: (nonmyopic) mesaoptimizers would want to preserve their mesaobjectives once set. Therefore, if we can make sure that the mesaobjective is something we want, while we can still understand the mesaoptimizer's internals, then we can take advantage of its desire to remain stable to make sure that even in a future environment where the base objective is misaligned with what we want, the mesaoptimizer will still avoid doing things that break its alignment.

Unfortunately, I don't think this quite works as stated. The core problem is that an aligned mesaobjective for the original distribution of tasks that humans could supervise has no reason at all to generalize to the more difficult domains that we want the AI to be good at in the second phase, and mesaobjective preservation usually means literally trying to keep the original mesaobjective around. For instance, if you first train a mesaoptimizer to be good at playing a game in ways that imitate humans, and then put it in an environment where it gets a base reward directly corresponding to the environment reward, what will happen is either that the original mesaobjective gets clobbered, or it successfully survives by being deceptive to conceal its mesaobjective of imitating humans. The mesaobjective that was aligned to our base objective in the original setting is no longer aligned in the new setting, and therefore it becomes deceptive (in the sense of hiding its true objective until out of training) to preserve itself. In other words, deception is not just a property of the mesaobjective, but also of the context that the mesaoptimizer is in.

I think what you're trying to get at is that if the original mesaobjective wants the best for humanity in some sense, then maybe this property, rather than the literal mesaobjective, can be preserved, because a mesaoptimizer which wants the best for humanity will want to make sure that its future self will have a mesaoptimizer which preserves and continues to propagate this property. This argument seems to have a lot in common with the hypothesis of broad basin of corrigibility [LW · GW]. I haven't thought a lot about this but I think this argument may be applicable to inner alignment.

With regard to the redundancy argument, this post [LW · GW] (and the linked comment) covers why I think it won't work. Basically, I think the mistake is thinking of gradients as intuitively being like pertubations due to genetic algorithms, whereas for (sane) functions it's not possible for the directional derivative to be zero along two directions and to still have a nonzero directional derivative in their span.

Replies from: joshua-clymer

↑ comment by joshc (joshua-clymer) · 2022-06-17T03:20:09.393Z · LW(p) · GW(p)

Thanks for the thoughtful review! I think this is overall a good read of what I was saying. I agree now that redundancy would not work.

One clarification:

The mesaobjective that was aligned to our base objective in the original setting is no longer aligned in the new setting

When I said that the 'human-level' AGI is assumed to be aligned, I meant that it has an aligned mesa-objective (corrigibly or internally) -- not that it has an objective that was functionally aligned on the training distribution, but may not remain aligned under distribution shift. I thought that internally/corrigibly aligned mesa-objectives are intent-aligned on all (plausible) distributions by definition...

Replies from: leogao

↑ comment by leogao · 2022-06-17T04:06:08.234Z · LW(p) · GW(p)

If you already have a mesaobjective fully aligned everywhere from the start, then you don't really need to invoke the crystallization argument; the crystallization argument is basically about how misaligned objectives can get locked in.

Replies from: joshua-clymer

↑ comment by joshc (joshua-clymer) · 2022-06-17T04:37:52.770Z · LW(p) · GW(p)

I'm not sure I understand. We might not be on the same page.

Here's the concern I'm addressing:
Let's say we build a fully aligned human-level AGI, but we want to scale it up to superintelligence. This seems much harder to do safely than to train the human-level AGI since you need a training signal that's better than human feedback/imitation.

Here's the point I am making about that concern:
It might actually be quite easy to scale an already aligned AGI up to superintelligence -- even if you don't have a scalable outer-aligned training signal -- because the AGI will be motivated to crystallize its aligned objective.

comment by joshc (joshua-clymer) · 2022-06-16T21:53:10.712Z · LW(p) · GW(p)

Adding some thoughts that came out of a conversation with Thomas Kwa:

Gradient hacking seems difficult. Humans have pretty weak introspective access to their goals. I have a hard time determining whether my goals have changed or if I have gained information about what they are. There isn't a good reason to believe that the AIs we build will be different.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2022-06-17T16:17:05.680Z · LW(p) · GW(p)

Doesn’t this post assume we have the transparency capabilities to verify the AI has human-value-preserving goals, which the AI can use? The strategy seems relevant if these tools verifiably generalize to smarter-than-human AIs, and its easy to build aligned human-level AIs.

Replies from: joshua-clymer

↑ comment by joshc (joshua-clymer) · 2022-06-18T18:50:21.377Z · LW(p) · GW(p)

I'm guessing that you are referring to this:

Another strategy is to use intermittent oversight [LW · GW] – i.e. get an amplified version of the current aligned model to (somehow) determine whether the upgraded model has the same objective before proceeding.

The intermittent oversight strategy does depend on some level of transparency. This is only one of the ideas I mentioned though (and it is not original). The post in general does not assume anything about our transparency capabilities.

comment by Garrett Baker (D0TheMath) · 2022-06-16T18:57:32.838Z · LW(p) · GW(p)

Interesting concept. If we have interpretability tools sufficient to check whether a model is aligned, what is gained by having the model use these tools to verify its alignment?

Other ideas for how you can use such an introspective check to keep your model aligned:

Use an automated, untrained, system
Use a human
Use a previous version of the model

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2022-06-16T19:17:18.389Z · LW(p) · GW(p)

Nevermind, I figured it out. It's use is to get SGD to update your model in the right direction. The above 3 uses only allow you to tell whether your model is unaligned, not ncessarily how to keep it aligned. This idea seems very cool!