Can there be an indescribable hellworld?

post by Stuart_Armstrong · 2019-01-29T15:00:54.481Z · score: 16 (6 votes) · LW · GW · 11 comments

Can there be an indescribable hellworld? What about an un-summarisable one?

By hellworld, I mean a world of very low value according to our value scales - maybe one where large number of simulations are being tortured (aka mind crimes).

A hellworld could look superficially positive, if we don't dig too deep [LW · GW]. It could look irresistibly positive [LW · GW].

Could it be bad in a way that we would find indescribable? It seems that it must be possible. The set of things that can be described to us is finite; the set of things that can be described to us without fundamentally changing our values is much smaller still. If a powerful AI was motivated to build a hellworld such that the hellish parts of it were too complex to be described to us, it would seem that it could. There is no reason to suspect that the set of indescribable worlds contains only good worlds.

Can it always be summarised?

Let's change the setting a bit. We have a world , and a powerful AI that is giving us information about . The is aligned/friendly/corrigible or whatever we need to be. It's also trustworthy, in that it always speaks to us in a way that increases our understanding.

Then if is an indescribable hellworld, can summarise that fact for us?

It seems that it can. In the very trivial sense, it can, by just telling us "it's an indescribable hellworld". But it seems it can do more than that, in a way that's philosophically interesting.

A hellworld is ultimately a world that is against our values. However, our values are underdefined and changeable [LW · GW]. So to have any chance of saying what these values are, we need to either extract key invariant values, synthesise our contradictory values [LW · GW] into some complete whole, or use some extrapolation procedure (eg CEV). In any case, there is a procedure for establishing our values (or else the very concept of "hellworld" makes no sense).

Now, it is possible that our values themselves may be indescribable to us now (especially in the case of extrapolations). But can at least tell us that is against our values, and provide some description as to the value it is against, and what part of the procedure ended up giving us that value. This does give us some partial understanding of why the hellworld is bad - a useful summary, if you want.

On a more meta level, imagine the contrary - that was hellworld, but the superintelligent agent could not indicate what human values it actually violated, even approximately. Since our values are not some exnihilio thing floating in space, but derived from us, it is hard to see how something could be against our values in a way that could never be summarised to us. That seems almost definitionally impossible: if the violation of our values can never be summarised, even at the meta level, how can it be a violation of our values?

Trustworthy debate is FAI complete

It seems that the consequence of that is that we can avoid hellworlds (and, presumably, aim for heaven) by having a corrigible and trustworthy AI that engages in debate or is a devil's advocate [LW · GW]. Now, I'm very sceptical of getting corrigible or trustworthy AIs in general, but it seems that if we can, we've probably solved the FAI problem.

Note that even in the absence of a single given way of formalising our values, the AI could list the plausible formalisations for which was or wasn't a hellworld.

11 comments

comment by Kaj_Sotala · 2019-01-29T15:50:58.251Z · score: 5 (None votes) · LW · GW
A hellworld is ultimately a world that is against our values. However, our values are underdefined and changeable [LW · GW]. So to have any chance of saying what these values are, we need to either extract key invariant values, synthesise our contradictory values [LW · GW] into some complete whole, or use some extrapolation procedure (eg CEV). In any case, there is a procedure for establishing our values (or else the very concept of "hellworld" makes no sense).

It feels worth distinguishing between two cases of "hellworld":

1. A world which is not aligned with the values of that world's inhabitants themselves. One could argue that in order to merit the designation "hellworld", the world has to be out of alignment with the values of its inhabitants in such a way as to cause suffering. Assuming that we can come up with a reasonable definition of suffering, then detecting these kinds of worlds seems relatively straightforward: we can check whether they contain immense amounts of suffering.

2. A world whose inhabitants do not suffer, but which we might consider hellish according to our values. For example, something like a Brave New World scenario, where people generally consider themselves happy but where that happiness comes at the cost of suppressing individuality and promoting superficial pleasures.

It's for detecting an instance of the second case that we need to understand our values better. But it's not clear to me that such a world should qualify as a "hellworld", which to me sounds like a world with negative value. While I don't find the notion of being the inhabitant of a Brave New World particularly appealing, a world where most people are happy but only in a superficial way sounds more like "overall low positive value" than "negative value" to me. Assuming that you've internalized its values and norms, existing in a BNW doesn't seem like a fate worse than death, it just sounds like a future that could have gone better.

Of course, there is an argument that even if a BNW would be okay to its inhabitants once we got there, getting there might cause a lot of suffering: for instance, if there were lots of people who were forced against their will to adapt to the system. Since many of us might find the BNW to be a fate worse than death, then conditional on us surviving to live in the BNW, it's a hellworld (at least to us). But again this doesn't seem like it requires a thorough understanding of our values to detect: it just requires detecting the fact that if we survive to live in the BNW, we will experience a lot of suffering due to being in a world which is contrary to our values.

comment by Stuart_Armstrong · 2019-01-30T18:10:46.779Z · score: 4 (None votes) · LW · GW

Assuming that we can come up with a reasonable definition of suffering

Checking whether there is a large amount of suffering in a deliberately obfuscated world seems hard, or impossible if a superintelligent has done the obfuscating.

comment by Kaj_Sotala · 2019-01-30T20:39:04.834Z · score: 3 (None votes) · LW · GW

True, not disputing that. Only saying that it seems like an easier problem than solving human values first, and then checking whether those values are satisfied.

comment by John_Maxwell_IV · 2019-01-30T00:05:44.845Z · score: 2 (None votes) · LW · GW

the set of things that can be described to us without fundamentally changing our values is much smaller still

What's the evidence for this set being "much smaller"?

comment by Stuart_Armstrong · 2019-01-30T18:08:22.951Z · score: 2 (None votes) · LW · GW

Can you imagine sitting through a ten-year lecture without your values changing? Can you imagine sitting through that lecture without your values changing somewhat in reaction to the content?

comment by Kaj_Sotala · 2019-01-30T20:40:08.055Z · score: 5 (None votes) · LW · GW

This seems like it would mainly affect instrumental values rather than terminal ones.

comment by Stuart_Armstrong · 2019-01-30T21:38:56.309Z · score: 2 (None votes) · LW · GW

In many areas, we have no terminal values until the problem is presented to us, then we develop terminal values (often dependent on how the problem was phrased) and stick to them. Eg the example with Soviet and American journalists visiting each other's countries.

comment by Dagon · 2019-01-29T20:39:29.336Z · score: 2 (None votes) · LW · GW

I think a little more formal definition of "describable" and "summarizable" would help me understand. I start with a belief that any world is it's own best model, so I don't think worlds are describable in full I may be wrong - world-descriptions may compress incredibly well, and it's possible to describe a world IN ANOTHER WORLD. but fully describing a world inside a subset of that world itself cannot be done.

"summarizable" is more interesting. If it just means "a trustworthy valuation", then fine - it's possible, and devolves to "is there any such thing as a trustworthy summarizer". If it means some other subset of a description, then it may or may not be possible.

comment by Dagon · 2019-01-31T21:36:13.514Z · score: 2 (None votes) · LW · GW

Thinking about other domains, a proof is a summary of (an aspect of) a formal system. It provides a different level of information than is contained in the base axioms. Can we model "summary" of the suffering level/ratio/hellishness of a world in the same terms. It's not about trusting the agent, it's about the agent finding the subset of information about the world that shows us that the result is true.

comment by William_S · 2019-01-31T20:03:23.882Z · score: 1 (None votes) · LW · GW

Do you think you'd agree with a claim of this form applied to corrigibility of plans/policies/actions?

That is: If some plan/policy/action is uncorrigible, then A can provide some description of how the action is incorrigible.

comment by Stuart_Armstrong · 2019-02-01T13:41:28.836Z · score: 2 (None votes) · LW · GW

Given some definition of corrigibility, yes.