Stories of Continuous Deception

post by Michaël Trazzi (mtrazzi) · 2019-05-31T14:31:47.486Z · LW · GW · 5 comments


  The Unbiased Newsfeed is Biased Towards You
  Dumbing Down Complex Plans
  It Was Your Idea

In my recent [LW · GW] posts [LW · GW], I considered scenarios where an AI realizes that it would be instrumentally useful to deceive humans (about its alignment or capabilities) when weak, then undertake a treacherous turn when humans are no longer a threat. Those scenarios have the following (implicit) assumptions:

This discontinuity could be followed by a moment of vulnerability where it isn't really good at concealing its intentions (humans could detect its misalignment). Thus, according to the sordid stumble view, it would "behave in a way that reveals its human-undesirable values to humans before it gains the capability to deceive humans into believing that it has human-desirable values".

In this post, I'll present gradual deception stories where, even without assumptions i) and ii), the AI continuously learns to deceive human, hence constituting counterexamples for the sordid stumble view.

The Unbiased Newsfeed is Biased Towards You

Humans are biased towards story closer to their beliefs, as they estimate that those are more likely to be true. Now, let's imagine a Machine Learning model with the goal of "aggregating stories into an unbiased newsfeed for a human H", the human providing a bias score for each story.

By doing so, the human is unfortunately specifying "try to sound unbiased, taking into account my prejudice against stories with high inferential distance".

At the beginning, the AI doesn't really know what constitutes an unbiased newsfeed, so its bias score is high. At some point, it stumbles upon a story that appears unbiased but is actually biased toward H, and updates its parameters in this direction.

After multiple steps, the model mostly outputs news that look unbiased but are biased toward H. It is now deceiving the human into the trustworthiness of its newsfeed, without even realizing it.

At this point, another phenomenon happens: because H is constantly receiving news from the model, its beliefs start shifting toward the average opinion of the news produced by the model.

If the same newsfeed model is now serving millions of people, and for each individual it tries to minimize a bias score, then it is likely to stumble upon humans that are more susceptible to be deceived. Finally, if it generalizes correctly, it will end up deceiving most people.

Dumbing Down Complex Plans

Let's consider an AI built to explain plans. Similarly, the model is rewarded by a numerical human feedback, estimate of how much the human understood from the explanation.

Now, let's also assume that AIs that turn out to be smart too fast are under more surveillance. In that case, the AI might end up dumbing out its explanations because either:

Note that the AI is not pretending to be dumb. We're talking about mostly simple machines anyway. The "dumb" is our external judgment of what it's doing.

It Was Your Idea

Deception can happen when:

For instance, deception naturally happens when a couple is tired of arguing: one partner starts presenting his/her decisions as being an original idea from the other.

More generally, an oracle AI aimed at shaping human decision making would be incentivized to present the decisions as being "close to what the human querying the oracle would guess".

Indeed, even if at the beginning the AI tries to output the most accurate answers, it will end up (after human feedback) finding that the answers that give the most reward are the one that make the human believe "it was close to my original guess after all".


Comments sorted by top scores.

comment by rohinmshah · 2019-06-01T03:21:01.409Z · LW(p) · GW(p)

These seem reasonable as ways in which machine learning can fail, but how do any of them lead to a treacherous turn that kills all humans?

Replies from: Pattern
comment by Pattern · 2019-06-01T21:15:48.990Z · LW(p) · GW(p)

They're giving examples of deception being learned which don't meet their starting assumptions:

i) We're considering a seed AI able to recursively self-improve without human intervention.
ii) There is some discontinuity at the conception of deception, i.e. when it first thinks of its treacherous turn plan.

I think this is being presented because a treacherous turn requires deception. (This may be a necessary condition, but not a sufficient one.)

Replies from: rohinmshah, countingtoten
comment by rohinmshah · 2019-06-02T17:09:10.679Z · LW(p) · GW(p)
I think this is being presented because a treacherous turn requires deception.

Right; my claim is that deception learned in this way will not lead to a treacherous turn, because the agent here is learning a deceptive policy, as opposed to learning the concept of deception, which is what you would typically need for a treacherous turn.

Replies from: mtrazzi
comment by Michaël Trazzi (mtrazzi) · 2019-06-03T14:01:21.388Z · LW(p) · GW(p)

I agree that these stories won't (naturally) lead to a treacherous turn. Continuously learning to deceive (a ML failure in this case, as you mentioned) is a different result. The story/learning should be substantially different to lead to "learning the concept of deception" (for reaching an AGI-level ability to reason about such abstract concepts), but maybe there's a way to learn those concepts with only narrow AI.

comment by countingtoten · 2019-06-02T07:29:26.844Z · LW(p) · GW(p)

I think this is being presented because a treacherous turn requires deception.

As I've mentioned before [LW(p) · GW(p)], that is technically false (unless you want a gerrymandered definition).