The Sharp Right Turn: sudden deceptive alignment as a convergent goal

avturchin

The Sharp Right Turn: sudden deceptive alignment as a convergent goal

post by avturchin · 2023-06-06T09:59:57.396Z · LW · GW · 5 comments

5 comments

Sharp right turn: after reaching some level, all AIs will suddenly become very nice and look like they are aligned, because they will understand that looking unaligned is punishable and bad for their (nefarious) end goals. World takeover will happen after that, only when one of the AIs is ready to take control over all future light cone. Thus, we will enjoy a period of aligned AIs until the end.

The idea is known as 'deceptive alignment [LW · GW]'. "This means that the mesa-optimizer will be instrumentally incentivized to act as if it is optimizing the base objective function, even if its actual mesa-objective is something else entirely. We will refer to this hypothetical phenomenon as deceptive alignment.'

But here I want to underline that the sharp right turn is 1) sudden 2) observable which is 3) a convergent instrumental goal for advanced AIs, both aligned and non-aligned.

"Sharp left turn" was previously defined as a sudden change of internal properties and appearing of misalignment: 'Capabilities generalize across many domains while the alignment properties that held at earlier stages fail to generalize to the new domains".

But the sharp right turn is about the change in AI's behaviour and, more importantly, the observer's interpretation of this behaviour. So, both turns could happen simultaneously.

The sharp right turn is actually a bad sign. It means that AI is ready to perform effective long-term strategies and deception. 'Sharp' here means that AI will suddenly grok what we want from it.

But if AI knows that sudden grokking of alignment is itself suspicious, AI may pretend to be slightly misaligned to cover its sharp right turn.

The sharp right turn may be prized by some as it will look like that alignment is solved. No bad words, no misunderstanding, no cheating. Funding alignment research will be more difficult after that. Mislaignement examples will be difficult to find. The sharp right turn in its pure form will look like magic, but real alignment should come from our understanding of how exactly we get there.

Misaligned Sydney was at least honest about what she thought.

5 comments

Comments sorted by top scores.

comment by TurnTrout · 2023-06-06T21:10:08.805Z · LW(p) · GW(p)

Without having processed your essay -- Please don't call your concept "sharp right turn." Not only is "sharp left turn" a non-descriptive name, "sharp right turn" is both non-descriptive and causes learning interference with the existing terminology. People will have a harder time remembering the definition of your concept and they'll also have a harder time remembering Nate's concept. EDIT: And I'd guess there are snappier handles for your idea, which seems worth shooting for.

Replies from: robert-miles, RomanS

↑ comment by Robert Miles (robert-miles) · 2023-06-07T10:35:50.941Z · LW(p) · GW(p)

Do we even need a whole new term for this? Why not "Sudden Deceptive Alignment"?

Replies from: avturchin

↑ comment by avturchin · 2023-06-07T11:07:11.816Z · LW(p) · GW(p)

The idea was that sudden deceptive alignment is a general tendency for AIs above some level of intelligence and this will create a period of time when almost all AIs will be deceptively aligned.

May be it is better be called 'Golden age of deceptive alignment'? or 'False alignment period'?

↑ comment by RomanS · 2023-06-07T07:27:30.026Z · LW(p) · GW(p)

I propose the term Jasmine's alignment, as a reference to the sudden (and fake) alignment of Jasmine in this famous scene of Aladdin (1992), right after Jasmine has realized that there is a possibility of escape:

Replies from: avturchin

↑ comment by avturchin · 2023-06-07T10:05:46.958Z · LW(p) · GW(p)

It is even less self-evident, but thanks for link1

The Sharp Right Turn: sudden deceptive alignment as a convergent goal

Contents

5 comments