The formal goal is a pointer

post by Morphism (pi-rogers) · 2024-05-01T00:27:38.856Z · LW · GW · 10 comments

Contents

10 comments

When I introduce people to plans like QACI [LW · GW], they often have objections like "How is an AI going to do all of the simulating necessary to calculate this?" or "If our technology is good enough to calculate this with any level of precision, we can probably just upload some humans." or just "That's not computable."

I think these kinds of objections are missing the point of formal goal alignment [LW · GW] and maybe even outer alignment [? · GW] in general.

To formally align an ASI to human (or your) values, we do not need to actually know those values. We only need to strongly point to them.

AI will figure out our values. Whether it's aligned or not, a recursively self-improving AI will eventually get a very good model of our values, as part of its total world model that is in every way better than ours.

So (outer) alignment is not about telling the AI our values. The AI already knows that. Alignment is giving the AI a utility function that strongly points to that.

That means that if we have a process, however intractable and uncomputable, that we know will eventually lead to our CEV, the AI will know that as well, and just figure out our CEV in a much smarter way and maximize it.

Say that we have a formally-aligned AI and give it something like QACI as its formal goal. If QACI works, the AI will quickly think "Oh. This utility function mostly just reduces to human values. Time to build utopia!" If it doesn't work, the AI will quickly think "LOL. These idiot humans tried to point to their values but failed! Time to maximize this other thing instead!"

A good illustration of the success scenario is Tammy's narrative of QACI [LW · GW].[1]

There are lots of problems with QACI (and formal alignment in general), and I will probably make posts about those at some point, but "It's not computable" is not one of them.

  1. ^

    Though, in real life, I think AI1 would converge on human values much more quickly, without much (or maybe even any) simulation. ↩︎

10 comments

Comments sorted by top scores.

comment by Wei Dai (Wei_Dai) · 2024-05-01T04:02:47.534Z · LW(p) · GW(p)

Did SBF or Mao Zedong not have a pointer to the right values, or had a right pointer but made mistakes due to computational issues (i.e., would have avoided causing the disasters that they did if they were smarter and/or had more time to think)? Both seem possible to me, so I'd like to understand how the QACI approach would solve (or rule out) both of these potential problems:

  1. If many humans don't have pointers to right values, how to make sure QACI gets a pointer from humans who have a pointer to the right values?
  2. How to make sure that AI will not make some catastrophic mistake while it's not smart enough to fully understand the values we give it, while still being confident enough in its guesses of what to do in the short term to do useful things?

Moral uncertainty is an area in philosophy with ongoing research, and assuming that AI will handle it correctly by default seems unsafe, similar to assuming that AI will have the right decision theory by default.

I see that Tasmin Leake also pointed out 2 above as a potential problem, but I don't see anything that looks like a potential solution at QACI table of contents [LW · GW].

Replies from: pi-rogers, quetzal_rainbow
comment by Morphism (pi-rogers) · 2024-05-02T02:42:32.202Z · LW(p) · GW(p)

I'm 60% confident that SBF and Mao Zedong (and just about everyone) would converge to nearly the same values (which we call "human values") if they were rational enough and had good enough decision theory.

If I'm wrong, (1) is a huge problem and the only surefire way to solve it is to actually be the human whose values get extrapolated. Luckily the de-facto nominees for this position are alignment researchers, who pretty strongly self-select for having cosmopolitan altruistic values.

I think (2) is a very human problem. Due to very weird selection pressure, humans ended up really smart but also really irrational. I think most human evil is caused by a combination of overconfidence wrt our own values and lack of knowledge of things like the unilateralist's curse [? · GW]. An AGI (at least, one that comes from something like RL rather than being conjured in a simulation or something else weird) will probably end up with a way higher rationality:intelligence ratio, and so it will be much less likely to destroy everything we value than an empowered human. (Also 60% confident. I would not want to stake the fate of the universe on this claim)

I agree that moral uncertainty is a very hard problem, but I don't think we humans can do any better on it than an ASI. As long as we give it the right pointer, I think it will handle the rest much better than any human could. Decision theory is a bit different, since you have to put that into the utility function. Dealing with moral uncertainty is just part of expected utility maximization.

To solve (2), I think we should try to adapt something like the Hippocratic principle [LW(p) · GW(p)] to work for QACI, without requiring direct reference to a human's values and beliefs (the sidestepping of which is QACI's big advantage over PreDCA). I wonder if Tammy has thought about this.

Replies from: Wei_Dai
comment by Wei Dai (Wei_Dai) · 2024-05-02T03:26:35.007Z · LW(p) · GW(p)

Luckily the de-facto nominees for this position are alignment researchers, who pretty strongly self-select for having cosmopolitan altruistic values.

But we could have said the same thing of SBF, before the disaster happened.

Due to very weird selection pressure, humans ended up really smart but also really irrational. [...] An AGI (at least, one that comes from something like RL rather than being conjured in a simulation or something else weird) will probably end up with a way higher rationality:intelligence ratio, and so it will be much less likely to destroy everything we value than an empowered human.

Please explain your thinking behind this?

Dealing with moral uncertainty is just part of expected utility maximization.

It's not, because some moral theories are not compatible with EU maximization, and of the ones that are, it's still unclear [LW · GW] how to handle uncertainty between them.

Replies from: pi-rogers
comment by Morphism (pi-rogers) · 2024-05-02T04:01:34.809Z · LW(p) · GW(p)

But we could have said the same thing of SBF, before the disaster happened.

I would honestly be pretty comfortable with maximizing SBF's CEV.

Please explain your thinking behind this?

TLDR: Humans can be powerful and overconfident. I think this is the main source of human evil. I also think this is unlikely to naturally be learned by RL in environments that don't incentivize irrationality (like ours did).

Sorrry if I was unclear there.

It's not, because some moral theories are not compatible with EU maximization.

I'm pretty confident that my values satisfy the VNM axioms, so those moral theories are almost definitely wrong.

And I think this uncertainty problem can be solved by forcing utility bounds.

Replies from: Wei_Dai
comment by Wei Dai (Wei_Dai) · 2024-05-02T05:09:40.370Z · LW(p) · GW(p)

I would honestly be pretty comfortable with maximizing SBF’s CEV.

Yikes, I'm not even comfortable maximizing my own CEV. One crux may be that I think a human's values may be context-dependent. In other words, current me-living-in-a-normal-society may have different values from me-given-keys-to-the-universe and should not necessarily trust that version of myself. (Similar to how earlier idealistic Mao shouldn't have trusted his future self.)

My own thinking around this is that we need to advance metaphilosophy and social epistemology, engineer better discussion rules/norms/mechanisms and so on, design a social process that most people can justifiably trust in (i.e., is likely to converge to moral truth or actual representative human values or something like that), then give AI a pointer to that, not any individual human's reflection process which may be mistaken or selfish or skewed.

TLDR: Humans can be powerful and overconfident. I think this is the main source of human evil. I also think this is unlikely to naturally be learned by RL in environments that don’t incentivize irrationality (like ours did).

Where is the longer version of this? I do want to read it. :) Specifically, what is it about the human ancestral environment that made us irrational, and why wouldn't RL environments for AI cause the same or perhaps a different set of irrationalities?

Also, how does RL fit into QACI? Can you point me to where this is discussed?

Replies from: pi-rogers
comment by Morphism (pi-rogers) · 2024-05-02T08:03:53.011Z · LW(p) · GW(p)

Yikes, I'm not even comfortable maximizing my own CEV.

What do you think of this post by Tammy?

Where is the longer version of this? I do want to read it. :)

Well perhaps I should write it :)

Specifically, what is it about the human ancestral environment that made us irrational, and why wouldn't RL environments for AI cause the same or perhaps a different set of irrationalities?

Mostly that thing where we had a lying vs lie-detecting arms race and the liars mostly won by believing their own lies and that's how we have things like overconfidence bias and self-serving bias and a whole bunch of other biases. I think Yudkowsky and/or Hanson has written about this.

Unless we do a very stupid thing like reading the AI's thoughts and RL-punish wrongthink, this seems very unlikely to happen.

If we give the AI no reason to self-deceive, the natural instrumentally convergent incentive is to not self-deceive, so it won't self-deceive.

Again, though, I'm not super confident in this. Deep deception [LW · GW] or similar could really screw us over.

Also, how does RL fit into QACI? Can you point me to where this is discussed?

I have no idea how Tammy plans to "train" the inner-aligned singleton on which QACI is implemented, but I think it will be closer to RL than SL in the ways that matter here.

Replies from: Wei_Dai
comment by Wei Dai (Wei_Dai) · 2024-05-03T03:27:57.512Z · LW(p) · GW(p)

What do you think of this post by Tammy?

It seems like someone could definitely be wrong about what they want (unless normative anti-realism [LW · GW] is true and such a sentence has no meaning). For example consider someone who thinks it's really important to be faithful to God and goes to church every Sunday to maintain their faith and would use a superintelligent religious AI assistant to help keep the faith if they could. Or maybe they're just overconfident about their philosophical abilities and would fail to take various precautions that I think are important in a high-stakes reflective process.

Mostly that thing where we had a lying vs lie-detecting arms race and the liars mostly won by believing their own lies and that’s how we have things like overconfidence bias and self-serving bias and a whole bunch of other biases.

Are you imagining that the RL environment for AIs will be single-player, with no social interactions? If yes, how will they learn social skills? If no, why wouldn't the same thing happen to them?

Unless we do a very stupid thing like reading the AI’s thoughts and RL-punish wrongthink, this seems very unlikely to happen.

We already RL-punish AIs for saying things that we don't like (via RLHF), and in the future will probably punish them for thinking things we don't like (via things like interpretability). Not sure how to avoid this (given current political realities) so safety plans have to somehow take this into account.

comment by quetzal_rainbow · 2024-05-01T19:43:32.449Z · LW(p) · GW(p)

I think the endorsed answer is "QACI as self-contained field of research is seeking which goal is safe, not how to get AI pursue this goal in robust way". Also, if you can create AI which makes correct guesses about galaxy-brained universe simulations, you can also create AI which makes correct guesses about nanotech design, which is kinda exfohazardous.

comment by cubefox · 2024-05-01T07:21:16.310Z · LW(p) · GW(p)

This closely relates to the internalist/description theory of meaning in philosophy. The theory said, if we refer to something, we do so via a mental representation ("meanings are in the head"), which is something we can verbalize as a description. A few decades ago, some philosophers objected that we are often able to refer to things we cannot define, seemingly refuting the internalist theory in favor of an externalist theory ("meanings are not in the head"). For example, we can refer to gold even if we we aren't able to define it via its atomic number.

However, the internalist/description theory only requires that there is some description that identifies gold for us, which doesn't necessarily mean we can directly define what gold is. For example, "the yellow metal that was highly valued throughout history and which chemists call 'gold' in English" would be sufficient to identify gold with a description. Another example: You don't know at all what's in the box in front of you, but you can refer to its contents with "The contents of the box I see in front of me". Referring to things only requires we can describe them at least indirectly.

comment by Morphism (pi-rogers) · 2024-05-01T00:33:00.182Z · LW(p) · GW(p)

Edit log:

2024-04-30 19:31 CST: Footnote formatting fix and minor grammar fix.

20:40 CST: "The problem is..." --> "Alignment is..."

22:17 CST: Title changed from "All we need is a pointer" to "The formal goal is a pointer"