Comment by jade-bishop on Can coherent extrapolated volition be estimated with Inverse Reinforcement Learning? · 2019-04-15T20:03:13.613Z · score: 1 (1 votes) · LW · GW

Thank you for your feedback! I haven't read this yet, but it comes pretty close to a discussion I had with a friend over this post.

Essentially, her argument started with a simple counterargument: She bought peanut M&Ms when she didn't want to, and didn't realise she was doing it until afterwards. In a similar situation where she was hungry and in the same place, she desired peanut M&Ms to satisfy her hunger, but this time she didn't want them. She knew she didn't want peanut M&Ms, and didn't consciously decide to get them against that want; in this sense, I think a parallel can be drawn with akrasia, where rationality alone isn't enough.

Her point was this: There has to be a line drawn between "intentional conscious action" and "the result of a complex system of interacting parts that puppets the meat sack that holds our brain, sometimes in ways we don't intend." On a base level, this could result in, say, an AI that acts like a normal human but sometimes buys peanut M&Ms against their volition. On an agent-based level where an AI is no more or less capable than a human, this isn't much of an issue, and such things could make individual AI agents more convincing.

But if you want to make a superintelligent AI to run your ideal utopia, you don't want it to decide to feed everyone peanut M&Ms against their will on a whim.

The biggest issue is that we can't determine the difference between "intentional action" and "unintentional response". If we could, then it would then (according to her) be trivial to find out what the CEV of humanity is, no estimation needed.

My largest assumption was that the lowest common denominator of human behaviour is "principled reasoning in pursuit of fixed, though unstated, goals". More realistically, as another friend (and the post you linked) pointed out, the lowest common denominator of human behaviour is going to be "reproduce", which has very unfortunate implications for the Friendliness of this hypothetical agent.

A number of things could be done to ameliorate this, such as not including any means to reproduce or any data supporting reproduction in the trajectories, but they all seem inadequate or ad-hoc. I don't want to staple together a bunch of things I barely understand and declare it the Solution To AI (not that I was attempting to do that, anyway), especially when the issue isn't necessarily with the technology and theory. As the peanut-M&M-purchasing friend put, the technology is sufficient but this post overestimates humans. This wasn't actually what I expected to have an issue on, and it shifts it from "improve technology and theories" to... what, "improve humans"? I'm at a loss as to where to go from here; inverse reinforcement learning has a demonstrable use-case and benefits, but the data is... not good. Garbage in gives garbage out. Is it really possible to improve human behaviour (or our analysis/collection of human behaviour) to achieve better results?

Comment by jade-bishop on Conspiracy World is missing. · 2019-04-15T16:29:25.988Z · score: 4 (3 votes) · LW · GW

You can find it backed up on the Wayback Machine: Additionally, links to the same content are available on the LessWrong wiki at

Edit: Additionally, to address your first question, tags were likely lost in the shift to LessWrong 2.0, but I haven't been around long enough to confirm that.

Can coherent extrapolated volition be estimated with Inverse Reinforcement Learning?

2019-04-15T03:23:30.967Z · score: 14 (4 votes)