Using lying to detect human values
post by Stuart_Armstrong · 2018-03-15T11:37:05.408Z · LW · GW · 6 commentsContents
6 comments
In my current research, I've often re-discovering things that are trivial and obvious, but that suddenly become mysterious. For instance, it's blindingly obvious that the anchoring bias is a bias, and almost everyone agrees on this. But this becomes puzzling when we realise that there is no principled ways of deducing the rationality and reward of irrational agents.
Here's another puzzle. Have you ever seen someone try and claim that they have certain values that they manifestly don't have? Seen their facial expression, their grimaces, their hesitation, and so on.
There's an immediate and trivial explanation: they're lying, and they're doing it badly (which is why we can actually detect the lying). But remember that there is no way of detecting the preferences of an irrational agent. How can someone lie about something that is essentially non-existent, their values? Even if someone knew their own values, why would the tell-tale signs of lying surface, since there's no way that anyone else could ever check their values, even in principle?
But here evolution is helping us. Humans have a self-model of their own values; indeed, this is what we use to define what those values are. And evolution, being lazy, re-uses the self-model to interpret others. Since these self-models are broadly similar from person to person, people tend to agree about the rationality and values of other humans.
So, because of these self-models, our own values "feel" like facts. And because evolution is lazy, lying and telling the truth about our own values triggers the same responses as lying or telling the truth about facts.
This suggests another way of accessing the self-model of human values: train an AI to detect human lying and misdirection on factual matters, then feed that AI a whole corpus of human moral/value/preference statements. Given the normative assumption that lying on facts resembles lying on values, this is another avenue by which AIs can learn human values.
So far, I've been assuming that human values are a single, definite object. In my next post, I'll look at the messy reality of under-defined and contradictory human values.
6 comments
Comments sorted by top scores.
comment by Gordon Seidoh Worley (gworley) · 2018-03-15T17:57:19.856Z · LW(p) · GW(p)
But remember that there is no way of detecting the preferences of a rational agent.
I'm suspicious. Is there a reference for this claim? It seems for this to be true we need to at least be very precise about what we mean by "preferences".
Replies from: rohin-shah↑ comment by Rohin Shah (rohin-shah) · 2018-03-15T23:24:19.489Z · LW(p) · GW(p)
Pretty sure that he meant to say "an irrational agent" instead of "a rational agent", see https://arxiv.org/abs/1712.05812
Replies from: Stuart_Armstrong↑ comment by Stuart_Armstrong · 2018-03-16T03:51:32.550Z · LW(p) · GW(p)
Indeed! I've now corrected that error.
comment by avturchin · 2018-07-05T20:33:48.462Z · LW(p) · GW(p)
If someone claims to have a value that he obviously hasn't - it doesn't mean that he is lying, that is, consciously presenting wrong information. In most cases I observed of such behaviour, they truly belived that they are kind, animal loving whatever positive beings, and non-consciously ignored the instances of their own behaviour which demonstrated other set of preferences - strikingly obvious for external observers.
Replies from: Stuart_Armstrong↑ comment by Stuart_Armstrong · 2018-07-06T10:43:49.149Z · LW(p) · GW(p)
"I value animals" is pretty worthless; "I value animals more than economic growth" would be more informative.