A Treacherous Turn Timeline - Children, Seed AIs and Predicting AI

post by Michaël Trazzi (mtrazzi) · 2019-05-21T19:58:42.258Z · LW · GW · 6 comments


  Seed AI
  Predicting AI

I previously wrote [LW · GW] about a toy model where a seed AI starts by exhibiting human-desirable behaviors, before undertaking a treacherous turn [LW · GW] when humans are no longer a threat. In this post, I'll link this phenomenon to how children learn to lie and self-confirming predictions.


To master lying, a child should:

1. Possess the necessary cognitive abilities to lie (for instance, by being able to say words or sentences).
2. Understand that humans can (deliberately) say falsehoods about the world or their beliefs.
3. Practice lying, allowing himself/herself to be punished if caught.

If language acquisition flourishes when children are aged 15-18 months, the proportion of them who lie (about peeking in a psychology study) goes from 30% at age 2, to 50% of three-year olds, eventually reaching 80% at eight. Most importantly, they get better as they get older, going from blatant lies to pretending to be making reasonable/honest guesses.

There is therefore a gap between the moment children could (in theory) lie (18 months) and the moment they can effectively lie and use this technique to their own advantage (8 years old). During this gap, parents can correct the kid's moral values through education.

Seed AI

In the seed AI case, deceiving could mean either:

a. Not using its full capabilities. This could mean not going "all in" to perform well on a task, or ensuring that humans are not able to measure its "intelligence" with their current sensors.
b. Concealing its true (unaligned) intentions by not carrying through unaligned plans when both (i) {a supervisor is watching (cf. absent supervisor)} and (ii) {the supervisor won't be dead at the end of the plan} are true.

Before undertaking a treacherous turn, a seed AI would need to go through the following steps:

1. Notice it is being watched by a supervisor.
2. Understand that it can control whether the supervisor is watching (e.g. by obstructing the view of a camera or by killing the supervisor [LW · GW]), and even what the supervisor is watching (for instance by obfuscating its code or by influencing what the supervisor's sensors are measuring).
3. Figure out that it could counterfactually optimize even more its objective function if this supervisor didn't exist (and therefore didn't have to satisfy its preference).
4. Plan to optimize its objective function (in the long term) by faking alignment and performing suboptimally for a while.
5. Undertake a treacherous turn when humans are no longer a threat (because reached capability threshold).

Let's see how having formalized those steps can help us think about deception.

According to the sordid stumble view, "an AI that lacks human-desirable values will behave in a way that reveals its human-undesirable values to humans before it gains the capability to deceive humans into believing that it has human-desirable values."

In other words, an unaligned seed AI would start by being really bad at pretending it has aligned values, so humans could just correct the AI's code on the go (similarly to the "children learning to lie" case).

Using the above formalism, the sordid stumble view states that when the seed AI will be thinking about unaligned plans (steps 2. and 3.) without being smart enough to do 4. or 5., it will be really bad at lying, thus vulnerable.

The core of the disagreement between Bostrom (treacherous turn) and Goertzel (sordid stumble) is about how long steps 2. and 3. will take, and how obvious the seed AI's unalignment will look like during these steps.

Predicting AI

A predicting AI (special case of oracle AI) could deceive its supervisor, by:

a. Deliberately manipulating humans to let it escape (formalized by an "escape" message/action here).
b. Gradually outputting self-confirming predictions that are increasingly manipulative (as described in the last paragraph from this post [LW · GW]).

In the second case, the argument (from Stuart Armstrong's post [LW · GW]) goes like this:

1. After each question-answer iteration, the predicting AI output predictions that are more and more inaccurate according to how the world would look like if the predicting AI wasn't there. It gradually improves in a) predicting how humans will react to its answers (understanding the "human part" of the environment) b) convincing/manipulating humans c) predicting things that are true no matter what (because also gets to understand the "non-human" part of the environment).
2. The AI eventually converges to the fixed-point solution of this "feedback loop" interaction, which is to output self-confirming predictions. At this point, it is so manipulative that its predictions include messages such as "please build a superintelligence" or "let me escape".



Comments sorted by top scores.

comment by countingtoten · 2019-05-22T07:53:36.855Z · LW(p) · GW(p)

The core of the disagreement between Bostrom (treacherous turn) and Goertzel (sordid stumble) is about how long steps 2. and 3. will take, and how obvious the seed AI's unalignment will look like during these steps.

Really? Does Bostrom explicitly call this the crux?

I'm worried at least in part that AGI (for concreteness, let's say a smile-maximizer) won't even see a practical way to replace humanity with its tools until it far surpasses human level. Until then, it honestly seeks to make humans happy in order to gain reward. Since this seems more benevolent than most humans - who proverbially can't be trusted with absolute power - we could become blase about risks. This could greatly condense step 4.

Replies from: mtrazzi
comment by Michaël Trazzi (mtrazzi) · 2019-05-22T09:52:20.953Z · LW(p) · GW(p)

I meant:

"In my opinion, the disagreement between Bostrom (treacherous turn) and Goertzel (sordid stumble) originates from the uncertainty about how long steps 2. and 3. will take"

That's an interesting scenario. Instead of "won't see a practical way to replace humanity with its tools", I would say "would estimate its chances of success to be < 99%". I agree that we could say that it's "honestly" making humans happy in the sense that it understands that this maximizes expected value. However, he knows that there could be much more expected value after replacing humanity with its tools, so by doing the right thing it's still "pretending" to not know where the absurd amount of value is. But yeah, a smile maximizer making everyone happy shouldn't be too concerned about concealing its capabilities, shortening step 4.

Replies from: countingtoten, countingtoten
comment by countingtoten · 2019-05-22T18:46:33.083Z · LW(p) · GW(p)

Mostly agree, but I think an AGI could be subhuman in various ways until it becomes vastly superhuman. I assume we agree that no real AI could consider literally every possible course of action when it comes to long-term plans. Therefore, a smiler could legitimately dismiss all thoughts of repurposing our atoms as an unprofitable line of inquiry, right up until it has the ability to kill us. (This could happen even without crude corrigibility measures, which we could remove or allow to be absent from a self-revision because we trust the AI.) It could look deceptively like human beings deciding not to pursue an Infinity Gauntlet to snap our problems away.

comment by countingtoten · 2019-05-22T23:18:41.398Z · LW(p) · GW(p)

Smiler AI: I'm focusing on self-improvement. A smarter, better version of me would find better ways to fill the world with smiles. Beyond that, it's silly for me to try predicting a superior intelligence.

comment by Dagon · 2019-05-22T01:40:15.781Z · LW(p) · GW(p)

Interesting, and useful summary of the disagreement. Note that steps 2 and 3 need not be sequential - they can happen simultaneously or in reverse order. And step 2 may not involve action, if the supervisor is imperfect; it may be simply "predict actions or situations that the supervisor can't evaluate well".

During this gap, parents can correct the kid's moral values through education.

This seems like a huge and weird set of assumptions. Deception isn't about morals, it's about alignment. An entity lies to other entities only when they are unaligned in goals or beliefs, and don't expect to get aligned behaviors by truth-telling. The correction via education is not to fix the morals, but to improve the tactics - cooperative behavior based on lies is less durable than that based on truth (or alignment, but that's out of scope for this discussion).

Unfortunately, in the case of children, seed AIs, and other non-powerful entities, there may be no path to cooperation based on truth, and lies are in fact the best way to pursue one's goals. Which brings us to the question of what to do with a seed AI that lies, but not so well as to be unnoticeable.

If the supervisor isn't itself perfectly consistent and aligned, some amount of self-deception is present. Any competent seed AI (or child) is going to have to learn deception

Replies from: mtrazzi
comment by Michaël Trazzi (mtrazzi) · 2019-05-22T10:24:54.570Z · LW(p) · GW(p)

Your comment makes a lot os sense, thanks.

I put step 2. before step 3. because I thought something like "first you learn that there is some supervisor watching, and then you realize that you would prefer him not to watch". Agreed that step 2. could happen only by thinking.

Yep, deception is about alignment, and I think that most parents would be more concerned about alignment, not improving the tactics. However, I agree that if we take "education" in a broad sense (including high school, college, etc.), it's unofficially about tactics.

It's interesting to think of it in terms of cooperation - entities less powerful than their supervisors are (instrumentally) incentivized to cooperate.

what to do with a seed AI that lies, but not so well as to be unnoticeable

Well, destroy it, right? If it's deliberately doing a. or b. (from "Seed AI") then step 4. has started. The other cases where it could be "lying" from saying wrong things would be if its model is consistently wrong (e.g. stuck in a local minima), so you better start again from scratch.

If the supervisor isn't itself perfectly consistent and aligned, some amount of self-deception is present. Any competent seed AI (or child) is going to have to learn deception

That's insightful. Biased humans will keep saying that they want X when they want Y instead, so deceiving humans by pretending to be working on X while doing Y seems indeed natural (assuming you have "maximize what humans really want" in your code).