0 comments

Comments sorted by top scores.

comment by MSRayne · 2022-08-13T12:25:12.182Z · LW(p) · GW(p)

This shows a blind spot in my thinking (or in other people's?) - I've always passively assumed we'd end up turning the whole universe into computronium and living entirely as uploads anyway, so I never would have thought "hey, let's separate the program from the implementation" - I only ever have thought about the program (what virtual world I actually want to get instantiated) and never about the implementation to begin with - nor did I ever consider that people might have values over the state of the physical world itself, rather than over their own subjective experience. (Well, it's obvious bioconservatives would, but I don't expect them to win the universe.)

comment by Vladimir_Nesov · 2022-08-13T14:04:10.479Z · LW(p) · GW(p)

This idea has been around for some time, known as indirect normativity. The variant you describe is also my own preferred formulation [LW(p) · GW(p)] at the time. For a few years it was a major motivation for studying decision theory for me, since this still needs the outer AGI to actually run the program, and ideally also yield control to that program eventually, when the program figures out its values and can slot them in as the values of the outer AGI.

This doesn't work out for several reasons. We don't actually have a way of creating the goal program. The most straightforward thing would be to use an upload, but that probably can't be done [LW(p) · GW(p)] before AGIs.

If we do have a sensible human imitation, then the thing to do with it is to build an HCH [LW(p) · GW(p)] that pushes the goodhart boundary [LW(p) · GW(p)] of that human imitation and allows stronger optimization of the world that doesn't break down our ability to assess its value. This gives the first aligned AGI directly, without turning the world into computronium.

Even if we did make a goal program, it's still unknown how to build an AGI that is motivated to compute it, or to follow the goals it outputs. It's not even known what kind of thing goals are, the type signature of that program that's needed to communicate the goals from the goal program to the outer AGI.

Even if we mostly knew how to build the outer AGI that runs a goal program (though with some confusion around the notion of goals still remaining), it's unclear that there are normative goals for humanity that are goals in a sense similar to a utility function in expected utility maximization, goals for a strong optimizer. We might want to discover such goals with [LW(p) · GW(p)] reflection [LW(p) · GW(p)], but that doesn't necessarily reach a conclusion, as reflection is unbounded [LW(p) · GW(p)].

More likely, there is just a sequence of increasingly accurate proxy goals with increasingly wide goodhart boundaries, instructing a mild optimizer how to act on states of the world it is able to assess. But then the outer AGI must already be a mild optimizer and not a predatory mature optimizer [LW · GW] that ignores all boundaries of what's acceptable [LW(p) · GW(p)] in pursuit of the goal it knows (in this case, the goal program).

This sets up motivation for what I currently see as valuable on the decision theory side: figuring out a principled way of doing mild optimization (there's only quantilization [? · GW] in this space at the moment). It should probably take something like goodhart boundary as a fundamental ingredient of its operation (it seems related to the base distribution of quantilization), the kind of thing that's traditionally missing [LW · GW] from decision theory.

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2022-10-24T14:31:08.939Z · LW(p) · GW(p)

Even if we did make a goal program, it's still unknown how to build an AGI that is motivated to compute it, or to follow the goals it outputs.

Actually, it is (to a 0th approximation) known how to build an AGI that is motivated to compute it: use infra-Bayesian physicalism [LW · GW]. The loss function in IBP already has the semantics "which programs should run". Following the goal it outputs is also formalizable within IBP, but even without this step we can just have utopia inside the goal program itself^[1].

We should be careful to prevent the inhabitants of the virtual utopia from creating unaligned AI which eats the utopia. This sounds achievable, assuming the premise that we can actually construct such programs. ↩︎

comment by Chris_Leong · 2023-02-07T03:03:03.994Z · LW(p) · GW(p)

Oh wow, it's fascinating to see someone actually investigating this proposal. (I had a similar idea, but only posted it in the EA meme group).