How could a friendly AI deal with humans trying to sabotage it? (like how present day internet trolls introduce such problems)

post by M. Y. Zuo · 2021-11-27T22:07:15.960Z · LW · GW · 10 comments

This is a question post.

After closely observing the behaviour of humans, online and in person, for a few years, it becomes ever clearer in my mind that a significant fraction (>1%) of the regular adult population would intentionally sabotage any future friendly AI. (speaking of Canada and US, though this probably applies to most if not all countries) Some wouldn’t even particularly be against friendly AI but would still behave destructively for the kicks, such as sadists and so on.

And that’s not to mention career criminals, or the really deranged who probably would be even less inhibited. I would bet even many teenagers would do so just for curiosity’s sake.

The phenomena, I think, would closely mirror the behaviour of present internet trolls when given an opportunity and an anonymizing screen to shield them. Probably a significant fraction would be the same people. As on most large online communities including LW, I assume everyone reading this would have encountered such individuals already so I will spare the examples.

The most straightforward solution I can think of is for the ‘friendly AI’ to treat these people adversarially, as if they were really trying to destroy or confound it, even if they might not actually want to. Excluding those AI that are completely isolated from uncontrolled interaction. But of course this introduces the problem of the supposedly ‘friendly‘ AI needing to ‘combat’, ‘doubt’, ‘interrogate’,  ‘oppress’, etc., some percentage of humans in order to carry out its normal functions. 

Also, there seems to be a slippery slope because it seems highly unlikely that once any such ‘friendly’ AI regularly carries out such operations that it will be able to resist applying the same methods to the merely annoying humans that are not quite as dangerous.  Such as those that reduce trust by breaking written promises, confidences, rules, etc., due to carelessness, temptations, and so on. Then onwards to compulsive alcoholics, gamblers, drug addicts, cultists, and others that negatively affect communities. Then perhaps even seemingly average people who nonetheless ruin conversations they participate in, such as by invoking Godwin’s law or other specious comparisons.

Admittedly, this slippery slope may take decades or centuries to slide down as there are defensible Schelling points, and other factors along the way well discussed on LW, that would counteract this phenomena. 


answer by Charlie Steiner · 2021-11-28T12:16:12.225Z · LW(p) · GW(p)

With sufficient predictive capability, nobody needs to be oppressed, at least not in the usual sense. They will just find themselves nudged down the path that lets them be satisfied without much harming others.

I'm probably more comfortable with this future than most. I think that it's an interesting question, how we should relate emotionally to being predicted and optimized for.

And of course for AI that isn't able to model humans well enough to notice and ignore/correct bad behavior, yes obviously we shouldn't give a bunch of unstable high-leverage power to lots of randos. But it still could be reasonable to give relativlely stable, well-aggregated power to the masses.

comment by M. Y. Zuo · 2021-12-01T01:19:22.093Z · LW(p) · GW(p)

So it seems that incipient AI need a protected environment to develop into one capable of reliably carrying out such activities. Much like raising children in protected environments before adulthood.


Comments sorted by top scores.

comment by Dagon · 2021-11-28T02:47:42.866Z · LW(p) · GW(p)

The question is whether the AI has any better mechanisms for dealing with this than current human members of society do.  We seem to be heading down that slippery slope pretty quickly in the bigger cities on the US West Coast.  

It remains an under-defined thing what "alignment" means.  Most people assume it's the more pleasant part of the distribution of human values, not a perfect representation of each existing human. Which may mean the solution is for the AI to determine the selectorate, or subset of humans who'll be actively judging and correcting it, and find ways to make them happy.  Presuming these people are squeamish, that probably doesn't mean elimination of the disruptive, but it might include containment and minimization of interaction.

Replies from: conor-sullivan
comment by Conor Sullivan (conor-sullivan) · 2021-11-29T06:00:03.867Z · LW(p) · GW(p)

Could the selectorate be all living (adult) humans? 

Replies from: Dagon
comment by Dagon · 2021-11-29T14:28:57.541Z · LW(p) · GW(p)

Well, no, as that includes the trolls and other destructive people.  

Replies from: conor-sullivan
comment by Conor Sullivan (conor-sullivan) · 2021-11-30T01:32:36.897Z · LW(p) · GW(p)

I mean something like direct democracy. Trolls wouldn't be able to shape the ASI's behavior unless they are 50%+1 of the human population. Something like that.

comment by avturchin · 2021-11-28T11:52:26.310Z · LW(p) · GW(p)

AI may find the ways to satisfy such people without causing harm to society. They could provide them an option to hunt on robots and even think that this is actually harming the ruling ASI.

Replies from: conor-sullivan
comment by Conor Sullivan (conor-sullivan) · 2021-11-29T06:05:06.496Z · LW(p) · GW(p)

It seems to me that most people would prefer that there was some way that the majority of people could "stop" the ASI (whatever that means) if it was tyrannical. Meaning, it should be possible for anarcho-primitivists and religious traditionalists to advocate for shutting down the machines and reversing the singularity somehow, and that if 50%+1 of the human population agreed, that we would somehow be returned. I don't think that's realistic, but I think that's what people mean when they say "I don't want to live in a future where AI takes over the world."

comment by Big Tony · 2021-11-29T18:55:00.185Z · LW(p) · GW(p)

The Metamorphosis of Prime Intellect covered this — the AI treated human sabotage like a kindly parent would treat an angry child: tolerance for the sabotage attempts, in the knowledge that it would be entirely futile.

I guess it depends on exactly how friendly the AI is, how much it wants to avoid non-existence, and how vulnerable it is.

Replies from: M. Y. Zuo
comment by M. Y. Zuo · 2021-12-01T01:16:35.138Z · LW(p) · GW(p)

That seems to be a plausible course of action if the AI(s) were in an unchallengeable position. But how would they get there without resolving the question prior?