post by [deleted] · · ? · GW · 0 comments

This is a link post for

0 comments

Comments sorted by top scores.

comment by countingtoten · 2019-05-22T07:53:36.855Z · LW(p) · GW(p)

The core of the disagreement between Bostrom (treacherous turn) and Goertzel (sordid stumble) is about how long steps 2. and 3. will take, and how obvious the seed AI's unalignment will look like during these steps.

Really? Does Bostrom explicitly call this the crux?

I'm worried at least in part that AGI (for concreteness, let's say a smile-maximizer) won't even see a practical way to replace humanity with its tools until it far surpasses human level. Until then, it honestly seeks to make humans happy in order to gain reward. Since this seems more benevolent than most humans - who proverbially can't be trusted with absolute power - we could become blase about risks. This could greatly condense step 4.

Replies from: mtrazzi
comment by Michaël Trazzi (mtrazzi) · 2019-05-22T09:52:20.953Z · LW(p) · GW(p)

I meant:

"In my opinion, the disagreement between Bostrom (treacherous turn) and Goertzel (sordid stumble) originates from the uncertainty about how long steps 2. and 3. will take"

That's an interesting scenario. Instead of "won't see a practical way to replace humanity with its tools", I would say "would estimate its chances of success to be < 99%". I agree that we could say that it's "honestly" making humans happy in the sense that it understands that this maximizes expected value. However, he knows that there could be much more expected value after replacing humanity with its tools, so by doing the right thing it's still "pretending" to not know where the absurd amount of value is. But yeah, a smile maximizer making everyone happy shouldn't be too concerned about concealing its capabilities, shortening step 4.

Replies from: countingtoten, countingtoten
comment by countingtoten · 2019-05-22T18:46:33.083Z · LW(p) · GW(p)

Mostly agree, but I think an AGI could be subhuman in various ways until it becomes vastly superhuman. I assume we agree that no real AI could consider literally every possible course of action when it comes to long-term plans. Therefore, a smiler could legitimately dismiss all thoughts of repurposing our atoms as an unprofitable line of inquiry, right up until it has the ability to kill us. (This could happen even without crude corrigibility measures, which we could remove or allow to be absent from a self-revision because we trust the AI.) It could look deceptively like human beings deciding not to pursue an Infinity Gauntlet to snap our problems away.

comment by countingtoten · 2019-05-22T23:18:41.398Z · LW(p) · GW(p)

Smiler AI: I'm focusing on self-improvement. A smarter, better version of me would find better ways to fill the world with smiles. Beyond that, it's silly for me to try predicting a superior intelligence.

comment by Dagon · 2019-05-22T01:40:15.781Z · LW(p) · GW(p)

Interesting, and useful summary of the disagreement. Note that steps 2 and 3 need not be sequential - they can happen simultaneously or in reverse order. And step 2 may not involve action, if the supervisor is imperfect; it may be simply "predict actions or situations that the supervisor can't evaluate well".

During this gap, parents can correct the kid's moral values through education.

This seems like a huge and weird set of assumptions. Deception isn't about morals, it's about alignment. An entity lies to other entities only when they are unaligned in goals or beliefs, and don't expect to get aligned behaviors by truth-telling. The correction via education is not to fix the morals, but to improve the tactics - cooperative behavior based on lies is less durable than that based on truth (or alignment, but that's out of scope for this discussion).

Unfortunately, in the case of children, seed AIs, and other non-powerful entities, there may be no path to cooperation based on truth, and lies are in fact the best way to pursue one's goals. Which brings us to the question of what to do with a seed AI that lies, but not so well as to be unnoticeable.

If the supervisor isn't itself perfectly consistent and aligned, some amount of self-deception is present. Any competent seed AI (or child) is going to have to learn deception


Replies from: mtrazzi
comment by Michaël Trazzi (mtrazzi) · 2019-05-22T10:24:54.570Z · LW(p) · GW(p)

Your comment makes a lot os sense, thanks.

I put step 2. before step 3. because I thought something like "first you learn that there is some supervisor watching, and then you realize that you would prefer him not to watch". Agreed that step 2. could happen only by thinking.

Yep, deception is about alignment, and I think that most parents would be more concerned about alignment, not improving the tactics. However, I agree that if we take "education" in a broad sense (including high school, college, etc.), it's unofficially about tactics.

It's interesting to think of it in terms of cooperation - entities less powerful than their supervisors are (instrumentally) incentivized to cooperate.

what to do with a seed AI that lies, but not so well as to be unnoticeable

Well, destroy it, right? If it's deliberately doing a. or b. (from "Seed AI") then step 4. has started. The other cases where it could be "lying" from saying wrong things would be if its model is consistently wrong (e.g. stuck in a local minima), so you better start again from scratch.

If the supervisor isn't itself perfectly consistent and aligned, some amount of self-deception is present. Any competent seed AI (or child) is going to have to learn deception

That's insightful. Biased humans will keep saying that they want X when they want Y instead, so deceiving humans by pretending to be working on X while doing Y seems indeed natural (assuming you have "maximize what humans really want" in your code).