A Nonconstructive Existence Proof of Aligned Superintelligence

roko

A Nonconstructive Existence Proof of Aligned Superintelligence

post by Roko · 2024-09-12T03:20:09.531Z · LW · GW · 80 comments

This is a link post for https://transhumanaxiology.substack.com/p/a-nonconstructive-existence-proof

80 comments

Over time I have seen many people assert that “Aligned Superintelligence” may not even be possible in principle. I think that is incorrect and I will give a proof - without explicit construction - that it is possible.

80 comments

Comments sorted by top scores.

comment by RogerDearnaley (roger-d-1) · 2024-09-16T06:48:25.925Z · LW(p) · GW(p)

Here's a quick sketch of a constructive version:

1) build a superintelligence that can model both humans and the world extremely accurately over long time-horizons. It should be approximately-Bayesian, and capable of modelling its own uncertainties, concerning both humans and the world, i.e. capable of executing the scientific method

2) use it to model, across a statistically representative sample of humans, how desirable they would say a specific state of the world X is

3) also model whether the modeled humans are in a state (drunk, sick, addicted, dead, suffering from religious fanaticism, etc) that for humans is negatively correlated with accuracy on evaluative tasks, and decrease the weight of their output accordingly

4) determine whether the humans would change their mind later, after learning more, thinking for longer, experiencing more of X, learning about or experiencing subsequent consequences of state X, etc - if so update their output accordingly

5) implement some chosen (and preferably fair) averaging algorithm over the opinions of the sample of humans

6) sum over the number of humans alive in state X and integrate over time

7) estimate error bars by predicting when and how much the superintelligence and/or the humans it's modelling are operating out of distribution/in areas of Knightian uncertainty (for the humans, about how the world works, and for the superintelligence itself both about how the world words and how humans think), and pessimize over these error bars sufficiently to overcome the Look Elsewhere Effect for the size of your search space, in order to avoid Goodhart's Law [LW · GW]

8) take (or at least well-approximate) argmax of steps 2)-7) over the set of all generically realizable states to locate the optimal state X*

9) determine the most reliable plan to get from the current state to the optimal state X* (allowing for the fact that along the way you will be iterating this process, and learning more, which may affect step 7) in future iterations, thus changing X*, so actually you want to prioritize retaining optionality and reducing prediction uncertainty, which implies you want to do Value Learning [? · GW] to reduce the uncertainty in modelling the humans' opinions)

10) Profit

Now, where were those pesky underpants gnomes?

[Yes, this is basically an approximately-Bayesian upgrade of AIXI with a value learned utility function rather than a hard-coded one. For a more detailed exposition, see my link [LW · GW] above.]

Replies from: Roko

↑ comment by Roko · 2024-09-16T12:36:07.148Z · LW(p) · GW(p)

argmax of steps 2)-7) over the set of all generically realizable states

Argmax search is dangerous. If you want something "constructive" I think you probably want to more carefully model the selection process.

Replies from: roger-d-1

↑ comment by RogerDearnaley (roger-d-1) · 2024-09-16T18:23:46.855Z · LW(p) · GW(p)

That's the point of step 7)

Replies from: Roko

↑ comment by Roko · 2024-09-17T09:35:47.880Z · LW(p) · GW(p)

I'm not particularly sold on the idea of launching a powerful argmax search and then doing a bit of handwaving to fix it.

It's like if you wanted a childminder to look after your young child, and you set off an argmax search to find the argmax of a function that looks like (quality) / (cost) and then afterwards trying to sort out whether your results are somehow broken/goodhearted.

If your argmax search is over 20 local childminders then that's probably fine.

But if it's an argmax search over all possible states of matter occupying an 8 cubic meter volume then... uh yeah that's really dangerous.

Replies from: roger-d-1

↑ comment by RogerDearnaley (roger-d-1) · 2024-09-17T22:29:02.944Z · LW(p) · GW(p)

The pessimizing over Knightian uncertainty is a graduated way of telling the model to basically "tend to stay inside the training distribution". Adjusting its strength enough to overcome the Look-Elsewhere Effect means we estimate how many bits of optimization pressure we're applying and then do the pessimizing harder depending on that number of bits, which, yes, is vastly higher for all possible states of matter occupying an 8 cubic meter volume than for a 20-way search (the former is going to be a rather large multiple of Avagadro's number of bits, the latter is just over 4 bits). So we have to stay inside what we believe we know a great deal harder in the former case. In other words, the point you're raising is already addressed, in a quantified way, by the approach I'm outlining. Indeed on some level the main point of my suggestion is that there is a quantified and theoretically motivated way of dealing with exactly this problem. The handwaving above is a just a very brief summary, accompanied by a link to a much more detailed post containing and explaining the details with a good deal less handwaving.

Trying to explain this piecemeal in a comments section isn't very efficient: I suggest you go read Approximately Bayesian Reasoning: Knightian Uncertainty, Goodhart, and the Look-Elsewhere Effect [LW · GW] for my best attempt at a detailed exposition of this part of the suggestion. If you still have criticisms or concerns after reading that, then I'd love to discuss them there.

Replies from: Roko

↑ comment by Roko · 2024-09-19T09:39:40.269Z · LW(p) · GW(p)

ok that's a fair point, I'll take a look but I am still skeptical about being able to do this in practice because in practice the universe is messy.

e.g. if you're looking for an optimal practical babysitter and you really do start a search over all possible combinations of matter that fit inside a 2x2x2 cube and start futzing with the results of that search I think it will go wrong.

But if you adopt some constructive approach with some empirically grounded heuristics I expect it will work much better. E.g. start with a human. Exclude all males (sorry bros!). Exclude based on certain other demographics which I will not mention on LW. Exclude based on nationality. Do interviews. Do drug tests. Etc.

Your set of states of a 2x2x2 cube of matter will contain all kinds of things that are bad in ways you don't understand.

comment by drocta · 2024-09-16T20:27:43.917Z · LW(p) · GW(p)

If your argument is, "if it is possible for humans to produce some (verbal or mechanical) output, then it is possible for a program/machine to produce that output", then, that's true I suppose?

I don't see why you specified "finite depth boolean circuit".

While it does seem like the number of states for a given region of space is bounded, I'm not sure how relevant this is. Not all possible functions from states to {0,1} (or to some larger discrete set) are implementable as some possible state, for cardinality reasons.

I guess maybe that's why you mentioned the thing along the lines of "assume that some amount of wiggle room that is tolerated" ?

One thing you say is that the set of superintelligences is a subset of the set of finite-depth boolean circuits. Later, you say that a lookup table is implementable as a finite-depth boolean circuit, and say that some such lookup table is the aligned superintelligence. But, just because it can be expressed as a finite-depth boolean circuit, it does not follow that it is in the set of possible superintelligences. How are you concluding that such a lookup table constitutes a superintelligence? It seems

Now, I don't think that "aligned superintelligence" is logically impossible, or anything like that, and so I expect that there mathematically-exists a possible aligned-superintelligence (if it isn't logically impossible, then by model existence theorem, there exists a model in which one exists... I guess that doesn't establish that we live in such a model, but whatever).

But I don't find this argument a compelling proof(-sketch).

Replies from: Roko, Roko, Roko

↑ comment by Roko · 2024-09-17T10:11:26.402Z · LW(p) · GW(p)

if it isn't logically impossible

Until I wrote this proof, it was a live possibility that aligned superintelligence is in fact logically impossible.

↑ comment by Roko · 2024-09-17T09:50:24.058Z · LW(p) · GW(p)

Not all possible functions from states to {0,1} (or to some larger discrete set) are implementable as some possible state, for cardinality reasons

All cardinalities here are finite. The set of generically realizable states is a finite set because they each have a finite and bounded information content description (a list of instructions to realize that state, which is not greater in bits than the number of neurons in all the human brains on Earth).

Replies from: drocta

↑ comment by drocta · 2024-09-22T02:40:45.192Z · LW(p) · GW(p)

Yes, I knew the cardinalities in question were finite. The point applies regardless though. For any set X, there is no injection from 2^X to X. In the finite case, this is 2^n > n for all natural numbers n.

If there are N possible states, then the number of functions from possible states to {0,1} is 2^N , which is more than N, so there is some function from the set of possible states to {0,1} which is not implemented by any state.

Replies from: Roko

↑ comment by Roko · 2024-10-07T20:12:01.886Z · LW(p) · GW(p)

I never said it had to be implemented by a state. That is not the claim: the claim is merely that such a function exists.

Replies from: drocta

↑ comment by drocta · 2025-01-20T20:55:41.918Z · LW(p) · GW(p)

(Sorry for the late response, I hadn't checked my LW inbox much since my previous comments.)
If it were the case that such a function exists but cannot possibly be implemented (any implementation would be implementation as a state), and no other function satisfying the same constraints could possibly be implemented, that seems like it would be a case of it being impossible to have the aligned ASI. (Again, not that I think this is the case, just considering the validity of argument.)

The function that is being demonstrated to exist is the lookup table that produces the appropriate actions, yes? The one that is supposed to be implementable by a finite depth circuit?

↑ comment by Roko · 2024-09-17T09:37:31.977Z · LW(p) · GW(p)

How are you concluding that such a lookup table constitutes a superintelligence?

Isn't it enough that it achieves the best possible outcome? What other criteria do you want a "superintelligence" to have?

Replies from: drocta

↑ comment by drocta · 2024-09-22T03:07:07.593Z · LW(p) · GW(p)

Not if the point of the argument is to establish that a superintelligence is compatible with achieving the best possible outcome.

Here is a parody of the issue, which is somewhat unfair and leaves out almost all of your argument, but which I hope makes clear the issue I have in mind:

"Proof that a superintelligence can lead to the best possible outcome: Suppose by some method we achieved the best possible outcome. Then, there's no properties we would want a superintelligence to have beyond that, so let's call however we achieved the best possible outcome, 'a superintelligence'. Then, it is possible to have a superintelligence produce the best possible outcome, QED."

In order for an argument to be compelling for the conclusion "It is possible for a superintelligence to lead to good outcomes." you need to use a meaning of "a superintelligence" in the argument, such that the statement "It is possible for a superintelligence to lead to good outcomes", when interpreted with that meaning of "a superintelligence", produces the meaning you want that sentence to have? If I argue "it is possible for a superintelligence, by which I mean computer with a clock speed faster than N, to lead to good outcomes", then, even if I convincingly argue that a computer with a clock speed faster than N can lead to good outcomes, that shouldn't convince people that it is possible for a superintelligence, in the sense that they have in mind (presumably not defined as "a computer with a clock speed faster than N"), is compatible with good outcomes.

Now, in your argument you say that a superintelligence would presumably be some computational process. True enough! If you then showed that some predicate is true of every computational process, you would then be justified in concluding that that predicate is (presumably) true of every possible superintelligence. But instead, you seem to have argued that a predicate is true of some computational process, and then concluded that it is therefore true of some possible superintelligence. This does not follow.

Replies from: Roko

↑ comment by Roko · 2024-10-07T20:19:59.738Z · LW(p) · GW(p)

The problem with this is that people use the word "superintelligence" without a precise definition. Clearly they mean some computational process. But nobody who uses the term colloquially defines it.

So, I will make the assertion that if a computational process achieves the best possible outcome for you, it is a superintelligence. I don't think anyone would disagree with that.

If you do, please state what other properties you think a "superintelligence" must have other than being a computational process achieves the best possible outcome.

Replies from: drocta

↑ comment by drocta · 2025-01-20T21:35:52.760Z · LW(p) · GW(p)

If you are interested in convincing people who so far think "It is impossible for the existence of an artificial superintelligence to produce desirable outcomes" otherwise, you should have a meaning of "an aritifical superintelligence" in mind that is like what they mean by it.

If one suspects that it is impossible for an artificial superintelligence to produce desirable outcomes, then when one considers "among possible futures, the one(s) that have as good or better outcomes than any other possible future", one would suppose that these perhaps are not ones that contain superintelligences. And, so, one would suppose that the computational process that achieves the best outcome, would perhaps not be a superintelligence.

To convince such a person otherwise, you would have to establish that some properties that they consider characteristic of something being a superintelligence (which would probably be something like "is more intelligent and competent than any human" for some specified sense of "intelligent") is compatible with achieving good (or maximally good) outcomes.

If someone suspects that [insert name of some not-particularly-well-defined political ideology here] can't ever lead to good outcomes, it would not convince them otherwise to go through the same argument except with "government procedure" or whatever in place of the actuators and such of the computer program, and say that:

"Clearly it is possible for a government process to do the best that can be done by any government process. Then, such a government must count as [insert aforementioned name of a not very well-defined ideology] government, as it achieves all the things that someone who wants a [insert aforementioned name of a not very well-defined ideology] government could hope it would achieve. Therefore, it is possible for a [insert aforementioned name of a not very well-defined ideology] government to do the best that can be done by any government procedure."

, they would not find this compelling in the slightest! They would object that [insert aforementioned name of a not very well defined ideology] generally has properties P and Q, and that you haven't established that the P or Q are compatible with achieving the best that a government can achieve.

This would still be the case if P and Q are somewhat fuzzy concepts without a clear consensus on how to make them precise.

And, they would be right to object to this. As, indeed, the argument does not demonstrate for even one single particular way of making P or Q precise show that such a precise-ification of it is compatible with the government reaching the best results that a government can obtain.

______

To answer your question: for something to count as ASI in a reasonable sense of ASI, then it must be, for some reasonable sense of "more intelligent", more intelligent than any human.

If someone picked a sense of "more intelligent" that I considered reasonable, and demonstrated that having a computer program which is in that sense of "more intelligent" is more intelligent than all humans, isn't incompatible with achieving the best possible outcomes, then I would say that, for a reasonable sense of "ASI", they have demonstrated that there being an ASI is compatible with achieving the best possible outcome. (I might even say that they have demonstrated that, for that sense of ASI, that it is possible for an ASI to be aligned, though for that I think I might require that it be possible for the ASI (in that sense of ASI) to produce the outcome, not just be around at the same time.)

comment by mishka · 2024-09-12T06:51:25.948Z · LW(p) · GW(p)

The relevance to alignment is that the state you want is the one that is reached.

I think the main problem with the argument in the linked text is that it is too static. One is not looking for a static outcome, one is looking for a process with some properties.

And it might be that the set of properties one wants is contradictory. (I am not talking about my viewpoint, but about a logical possibility.)

For example, it might potentially be the case that there are no processes where superintelligence is present and the chances of "bad" things with "badness" exceeding some large threshold are small (for a given definition of "bad" and "badness"). That might be one possible way to express the conjecture about "impossibility of aligned superintelligence".

(I am not sure how one could usefully explore such a topic, it's all so vague, and we just don't know enough about our reality.)

Replies from: Roko

↑ comment by Roko · 2024-09-12T12:06:14.685Z · LW(p) · GW(p)

it might be that the set of properties one wants is contradictory.

So how is that a problem with AI alignment? If you want something that is impossible, it should come as no surprise that an AI cannot achieve it for you.

Replies from: mishka

↑ comment by mishka · 2024-09-12T13:33:47.406Z · LW(p) · GW(p)

(I am not talking about my viewpoint, but about a logical possibility.)

If it so happens that the property of the world is such that

there are no processes where superintelligence is present and the chances of "bad" things with "badness" exceeding some large threshold are small

but at the same time world lines where the chances of "bad" things with "badness" exceeding some large threshold are small do exist, then one has to avoid having superintelligence in order to have a chance at keeping probabilities of some particularly bad things low.

That is what people essentially mean when they say "ASI alignment is impossible". The situation where something "good enough" (low chances of certain particularly bad things happening) is only possible in the absence of superintelligence, but is impossible when superintelligence is present.

So, they are talking about a property of the world where certain unacceptable deterioration is necessarily linked to the introduction of superintelligence.

I am not talking about my viewpoint, but about a logical possibility. But I don't think your proof addresses that. In particular, because a directed acyclic graph is not a good model. We need to talk about a process, not a static state, so the model must be recurrent (if it's a directed acyclic graph, it must be applied in a fashion which makes the overall thing recurrent, for example in an autoregressive mode).

And we are talking about superintelligence which is usually assumed to be capable of a good deal of self-modifications and recursive self-improvement, so the model should incorporate that. The statement of "impossibility of sufficiently benign forms of superintelligence" might potentially have a form of a statement of "impossibility of superintelligence which would refrain from certain kinds of self-modification, with those kinds of self-modification having particularly unacceptable consequences".

And it's not enough to draw a graph which refrains from self-modification, because one can argue that a model which agrees to constrain itself in such a radical fashion as to never self-modify in an exploratory fashion is fundamentally not superintelligent (even humans often self-modify when given an opportunity and seeing a potential upside).

Replies from: Roko, Roko

↑ comment by Roko · 2024-09-12T14:35:06.431Z · LW(p) · GW(p)

a model which agrees to constrain itself in such a radical fashion as to never self-modify in an exploratory fashion is fundamentally not superintelligent

OK, what is your definition of "superintelligent"?

Replies from: mishka

↑ comment by mishka · 2024-09-12T15:16:33.474Z · LW(p) · GW(p)

Being able to beat humans in all endeavours by miles.

That includes the ability to explore novel paths.

Replies from: Roko, mishka

↑ comment by Roko · 2024-09-12T15:43:30.300Z · LW(p) · GW(p)

Being able to beat humans

What do you mean by humans? How large a group of humans? Infinite?

Replies from: mishka

↑ comment by mishka · 2024-09-12T15:44:12.851Z · LW(p) · GW(p)

10 billion

Replies from: Roko

↑ comment by Roko · 2024-09-12T16:08:54.977Z · LW(p) · GW(p)

But then it is possible for an AI to be able to up to 10 billion humans in all endeavours by miles, but also not modify itself.

In fact, I can prove that such an AI exists.

So you have two different and contradictory definitions of "superintelligence" that you are using.

Replies from: mishka

↑ comment by mishka · 2024-09-12T16:27:00.133Z · LW(p) · GW(p)

A realistic one, which can competently program and can competently do AI research?

Surely, since humans do pretty impressive AI research, a superintelligent AI will do better AI research.

What exactly might (even potentially) prevent it from creating drastically improved variants of itself?

Replies from: Roko

↑ comment by Roko · 2024-09-12T16:37:49.591Z · LW(p) · GW(p)

A superintelligence based on the first definition you gave (Being able to beat humans in all endeavours by miles) would be able to beat humans at AI research, but it would also be able to beat humans at not doing AI research.

So, by your own definition, in order to be a superintelligence, it must be able to spend the whole lifetime of the universe not doing AI research.

Replies from: mishka

↑ comment by mishka · 2024-09-12T16:41:57.767Z · LW(p) · GW(p)

You mean, a version which decides to sacrifice exploration and self-improvement, despite it being so tempting...

And that after doing quite a bit of exploration and self-improvement (otherwise it would not have gotten to the position of being powerful in the first place).

But then deciding to turn around drastically and become very conservative, and to impose a new "conservative on a new level world order"...

Yes, that is a logical possibility...

Replies from: mishka

↑ comment by mishka · 2024-09-12T16:47:16.054Z · LW(p) · GW(p)

Yes, OK.

I doubt that an adequate formal proof is attainable, but a mathematical existence of a "lucky one" is not implausible...

Replies from: mishka

↑ comment by mishka · 2024-09-12T16:55:56.319Z · LW(p) · GW(p)

Yes, an informal argument is that if it is way smarter and way more capable than humans, that it potentially should be better at being able to refrain from exercising the capabilities.

In this sense, the theoretical existence of a superintelligence which does not make things worse than they would be without existence of this particular superintelligence seems very plausible, yes... (And it's a good definition of alignment, "aligned == does not make things notably worse".)

Replies from: mishka

↑ comment by mishka · 2024-09-12T17:02:37.035Z · LW(p) · GW(p)

so these two considerations

if it is way smarter and way more capable than humans, that it potentially should be better at being able to refrain from exercising the capabilities

and

"aligned == does not make things notably worse"

taken together indeed constitute a nice "informal theorem" that the claim of "aligned superintelligence being impossible" looks wrong. (I went back and added my upvotes to this post, even though I don't think the technique in the linked post is good.)

Replies from: Roko

↑ comment by Roko · 2024-09-12T20:56:01.207Z · LW(p) · GW(p)

I don't think the technique in the linked post is good.

why not?

Replies from: mishka

↑ comment by mishka · 2024-09-13T06:23:36.693Z · LW(p) · GW(p)

I think I said already.

We are not aiming for a state to be reached. We need to maintain some properties of processes extending indefinitely in time. That formalism does not seem to do that. It does not talk about invariant properties of processes and other such things, which one needs to care about when trying to maintain properties of processes.
We don't know fundamental physics. We don't know the actual nature of quantum space-time, because quantum gravity is unsolved, we don't know what is "true logic" of the physical world, and so on. There is no reason why one can rely on simple-minded formalisms, on standard Boolean logic, on discrete tables and so on, if one wants to establish something fundamental, when we don't really know the nature of reality we are trying to approximate.

There are a number of reasons a formalization could fail even if it goes as far as proving the results within a theorem prover (which is not the case here). The first and foremost of those reasons is that formalization might fail to capture the reality with sufficient degree of faithfulness. That is almost certainly the case here.

But then a formal proof (an adequate version of which is likely to be impossible at our current state of knowledge) is not required. A simple informal argument above is more to the point. It's a very simple argument, and so it makes the idea that "aligned superintelligence might be fundamentally impossible" very unlikely to be true.

First of all, one step this informal argument is making is weakening the notion of "being aligned". We are only afraid of "catastrophic misalignment", so let's redefine the alignment as something simple which avoids that. An AI which sufficiently takes itself out of action, does achieve that. (I actually asked for something a bit stronger, "does not make things notably worse"; that's also not difficult, via the same mechanism of taking oneself sufficiently out of action.)

And a strongly capable AI should be capable to take itself out of action, to refrain from doing things. The capability to choose is an important capability, a strongly capable system is a system which, in particular, can make choices.

So, yes, a very capable AI system can avoid being catastrophically misaligned, because it can choose to avoid action. This is that non-constructive proof of existence which has been sought. It's an informal proof, but that's fine.

No extra complexity is required, and no extra complexity would make this argument better or more convincing.

Replies from: Roko

↑ comment by Roko · 2024-09-13T17:44:53.073Z · LW(p) · GW(p)

We need to maintain some properties of processes extending indefinitely in time. That formalism does not seem to do that.

You can run all the same arguments I used, but talk about processes rather than states.

Replies from: mishka

↑ comment by mishka · 2024-09-13T19:34:04.010Z · LW(p) · GW(p)

On one hand, you still assume too much:

Since our best models of physics indicate that there is only a finite amount of computation that can ever be done in our universe

No, nothing like that is at all known. It's not a consensus. There is no consensus that the universe is computable, this is very much a minority viewpoint, and it might always make sense to augment a computer with a (presumably) non-computable element (e.g. a physical random number generator, an analog circuit, a camera, a reader of human real-time input, and so on). AI does not have to be a computable thing, it can be a hybrid. (In fact, when people model real-world computers as Turing machines instead of modeling them as Turing machines with oracles, with the external world being the oracle, it leads to all kinds of problems, e.g. the well-known Penrose's "Goedel argument" makes this mistake and falls apart as soon as one remembers the presence of the oracle.)

Other than that...

Yes, you have an interesting notion of alignment. Not something which we might want, and might be possible, but might be unachievable by mere humans, but something much weaker than that (although not as weak as the version I put forward, my version is super-weak, and your version is intermediate in strength):

I claim then that for any generically realizable desirable outcome that is realizable by a group of human advisors, there must exist some AI which will also realize it.

Yes, this is obviously correct. An ASI can choose to emulate a group of human and its behavior, and being way more capable than that group of humans, it should be able to emulate that group as precisely as needed.

One does not need to say anything else to establish that.

Replies from: Roko

↑ comment by Roko · 2024-09-14T08:22:58.930Z · LW(p) · GW(p)

No, nothing like that is at all known. It's not a consensus

I disagree, modern physics places various bounds on compute such as the Beckenstein Bound.

https://en.wikipedia.org/wiki/Bekenstein_bound

If your objection to my proof involves infinite compute then I am happy to acknowledge that I honestly do not know what happens in that case. It is plausible that since humans are finite in complexity/information/compute, a world with infinite compute would break the symmetry between computers and humans that I am using here. Most likely it means that computers are capable of fundamentally superior outcomes, so there would be "hyperaligned" AIs. But since infinite compute is a minority position I will not pursue it.

Replies from: mishka

↑ comment by mishka · 2024-09-14T16:56:39.440Z · LW(p) · GW(p)

I don't see what the entropy bound has to do with compute. The Bekenstein bound is not much in question, but its link to compute is a different story. It does seem to limit how many bits can be stored in a finite volume (so for a potentially infinite compute an unlimited spatial expansion is needed).

But it does not say anything about possibilities of non-computable processes. It's not clear if "collapse of wave function" is computable, and it is typically assumed not to be computable. So powerful non-Turing-computable oracles seem to likely be available (that's much more than "infinite compute").

But I also think all these technicalities constitute an overkill, I don't see them as at all relevant.

This seems rather obvious regardless of the underlying model:

An ASI can choose to emulate a group of human and its behavior, and being way more capable than that group of humans, it should be able to emulate that group as precisely as needed.

This seems obviously true, no matter what.

I don't see why a more detailed formalization would help to further increase certainty. Especially when there are so many questions about that formalization.

If the situation were different, if the statement would not be obvious, even a loose formalization might help. But when the statement seems obvious, the standards a formalization needs to satisfy to further increase our certainty in the truth of the statement become really high...

Replies from: Roko

↑ comment by Roko · 2024-09-15T13:36:29.099Z · LW(p) · GW(p)

"collapse of wave function" is computable, and it is typically assumed not to be computable

The wavefunction never actually collapses if you believe in MWI. Rather, a classical reality emerges in all branches thanks to decoherence.

If you think something nonomputable happens because of quantum mechanics, it probably means that your interpretation of QM is wrong and you need to read the sequences on that.

Replies from: mishka

↑ comment by mishka · 2024-09-15T15:16:49.609Z · LW(p) · GW(p)

If you believe in MWI, then this whole argument is... not "wrong", but very incomplete...

Where is the consideration of branches? What does it mean for one entity to be vastly superior to another, if there are many branches?

If one believes in MWI, then the linked proof does not even start to look like a proof. It obviously considers only a single branch.

And a "subjective navigation" in the branches is not assumed to be computable, even if the "objective multiverse" is computable; that is the whole point of MWI, the "collapse" becomes "subjective navigation", but this does not make it computable. If a consideration is only of a single branch, that branch is not computable, even if it is embedded in a large computable multiverse.

Not every subset of a computable set (say, of a set of natural numbers) is computable.

An interpretation of QM can't be "wrong". It is a completely open research and philosophical question, there is no "right" interpretation, and the Sequences is (thankfully) not a Bible (if even a very respected thinker says something, this does not yet mean that one should accept that without questions).

Replies from: Roko

↑ comment by Roko · 2024-09-17T13:40:45.683Z · LW(p) · GW(p)

It obviously considers only a single branch.

Thanks to decoherece, you can just ignore any type of interference and treat each branch as a single classical universe.

Replies from: mishka

↑ comment by mishka · 2024-09-17T14:01:38.805Z · LW(p) · GW(p)

I don't think so. If it were classical, we would not be able to observe effects of double-slit experiments and so on.

And, also, there is no notion of "our branch" until one has traveled along it. At any given point in time, there are many branches ahead. Only looking back one can speak about one's branch. But looking forward one can't predict the branch one will end up in. One does not know the results of future "observations"/"measurements". This is not what a classical universe looks like.

(Speaking of MWI, I recall David Deutsch's "Fabric of Reality" very eloquently explaining effects from "neighboring branches". The reason I am referencing this book is that this was the work particularly strongly associated with MWI back then. So I think we should be able to rely on his understanding of MWI.)

Replies from: Roko, Roko

↑ comment by Roko · 2024-09-17T19:18:53.998Z · LW(p) · GW(p)

one can't predict the branch one will end up in

yes one can - all of them!

Replies from: mishka

↑ comment by mishka · 2024-09-17T21:54:05.673Z · LW(p) · GW(p)

Yes, but then what do you want to prove?

Something like, "for all branches, [...]"? That might be not that easy to prove or even to formulate. In any case, the linked proof has not even started to deal with this.

Something like, "there exist a branch such that [...]"? That might be quite tractable, but probably not enough for practical purposes.

"The probability that one ends up in a branch with such and such properties is no less than/no more than" [...]? Probably something like that, realistically speaking, but this still needs a lot of work, conceptual and mathematical...

Replies from: Roko

↑ comment by Roko · 2024-09-21T16:29:13.219Z · LW(p) · GW(p)

bringing QM into this is not helping. All these types of questions are completely generic QM questions and ultimately they come down to measure ||Psi>|²

Replies from: mishka

↑ comment by mishka · 2024-09-21T21:19:15.876Z · LW(p) · GW(p)

It's just... having a proof is supposed to boost our confidence that the conclusion is correct...

if the proof relies on assumptions which are already quite far from the majority opinion about our actual reality (and are probably going to deviate further, as AIs will be better physicists and engineers than us and will leverage the strangeness of our physics much further than we do), then what's the point of that "proof"?

how does having this kind of "proof" increase our confidence in what seems informally correct for a single branch reality (and rather uncertain in a presumed multiverse, but we don't even know if we are in a multiverse, so bringing a multiverse in might, indeed, be one of the possible objections to the statement, but I don't know if one wants to pursue this line of discourse, because it is much more complicated than what we are doing here so far)?

(as an intellectual exercise, a proof like that is still of interest, even under the unrealistic assumption that we live in a computable reality, I would not argue with that; it's still interesting)

↑ comment by Roko · 2024-09-17T19:18:23.082Z · LW(p) · GW(p)

we would not be able to observe effects of double-slit experiments

yes, but thanks to decoherence this generally doesn't affect macroscopic variables. Branches are causally independent once they have split.

Replies from: mishka

↑ comment by mishka · 2024-09-17T21:56:21.118Z · LW(p) · GW(p)

No. I can only repeat my reference to Fabric of Reality as a good presentation of MWI and to remind that we do not live in a classical world, which is easy to confirm empirically.

And there are plenty of known macroscopic quantum effects already, and that list will only grow. Lasers are quantum, superfluidity and superconductivity are quantum, and so on.

Replies from: Roko

↑ comment by Roko · 2024-09-21T16:31:01.971Z · LW(p) · GW(p)

Decoherence means that different branches don't interfere with each other on macroscopic scales. That's just the way it works.

Superfluids/superconductors/lasers are still microscopic effects that only matter at the scale of atoms or at ultra-low temperature or both.

Replies from: mishka

↑ comment by mishka · 2024-09-21T21:23:07.102Z · LW(p) · GW(p)

No, not microscopic.

Coherent light produced by lasers is not microscopic, we see its traces in the air. And we see the consequences (old fashioned holography and the ability to cut things with focused light, even at large distances). Room temperature is fine for that.

Superconductors used in the industry are not microscopic (and the temperatures are high enough to enable industrial use of them in rather common devices such as MRI scanners).

↑ comment by mishka · 2024-09-12T15:25:34.181Z · LW(p) · GW(p)

And I personally think that superintelligence leading to good trajectories is possible. It seems unlikely that we are in a reality where there is a theorem to the contrary.

It feels intuitively likely that it is possible to have superintelligence or the ecosystem of superintelligences which is wise enough to be able to navigate well.

But I doubt that one is likely to be able to formally prove that.

Replies from: mishka

↑ comment by mishka · 2024-09-12T15:49:41.615Z · LW(p) · GW(p)

But I doubt that one is likely to be able to formally prove that.

E.g. it is possible that we are in a reality where very cautious and reasonable, but sufficiently advanced experiments in quantum gravity lead to a disaster.

Advanced systems are likely to reach those capabilities, and they might make very reasonable estimates that it's OK to proceed, but due to bad luck of being in a particularly unfortunate reality, the "local neighborhood" might get destroyed as a result... One can't prove that it's not the case...

Whereas, if the level of overall intelligence remains sufficiently low, we might not be able to ever achieve the technical capabilities to get into the danger zone...

It is logically possible that the reality is like that.

Replies from: Roko

↑ comment by Roko · 2024-09-12T16:10:51.436Z · LW(p) · GW(p)

It is logically possible that the reality is like that.

Yes, it is. But even if that is the case, by the argument given in this post, there must exist an AI system that avoids the dangerzone.

Replies from: mishka

↑ comment by mishka · 2024-09-12T16:30:58.148Z · LW(p) · GW(p)

Yes, possibly.

Not by the argument given in the post (considering quantum gravity, one immediately sees how inadequate and unrealistic is the model in the post).

But yes, it is possible that they will be so wise that they will be cautious enough even in a very unfortunate situation.

Yes, I was trying to explicitly refute your claim, but my refutation has holes.

(I don't think you have a valid proof, but this is not yet a counterexample.)

↑ comment by Roko · 2024-09-12T14:36:25.664Z · LW(p) · GW(p)

there are no processes where superintelligence is present and the chances of "bad" things with "badness" exceeding some large threshold are small

Do you think a team of sufficiently wise humans is capable of producing a world where the chances of "bad" things with "badness" exceeding some large threshold are small? Yes or no?

Replies from: mishka

↑ comment by mishka · 2024-09-12T15:14:36.498Z · LW(p) · GW(p)

(I am not talking about my viewpoint, but about a logical possibility.)

In particular, humans might be able to refrain from screwing the world too badly, if they avoid certain paths.

(No, personally I don't think so. If people crack down hard enough, they probably screw up the world pretty badly due to the crackdown, and if they don't crack down hard enough, then people will explore various paths leading to bad trajectories, via superintelligence or via other more mundane means. I personally don't see a safe path, and I don't know how to estimate probabilities. But it is not a logical impossibility. E.g. if someone makes all humans dumb by putting a magic irreversible stupidifier in the air and water, perhaps those things can be avoided, hence it is logically possible. Do I want "safety" at this price? No, I think it's better to take risks...)

Replies from: Roko

↑ comment by Roko · 2024-09-12T15:42:56.211Z · LW(p) · GW(p)

humans might be able to refrain from screwing the world too badly

But then, if a team of humans is capable of producing a world where the chances of "bad" things with "badness" exceeding some large threshold are small, by exactly the argument given in this post there must be a Lookup Table which simply contains the same boolean function.

So, your claim is provably false. It is not possible for something (anything) to be generically achievable by humans but not by AI, and you're just hitting a special case of that.

Replies from: mishka

↑ comment by mishka · 2024-09-12T15:45:48.168Z · LW(p) · GW(p)

No, they are not "producing". They are just being impotent enough. Things are happening on their own...

And I don't believe a Lookup Table is a good model.

Replies from: Roko

↑ comment by Roko · 2024-09-12T16:06:39.680Z · LW(p) · GW(p)

They are just being impotent enough

An AI can also be impotent. Surely this is obvious to you? Have you not thought this through properly?

Replies from: mishka

↑ comment by mishka · 2024-09-12T16:24:07.964Z · LW(p) · GW(p)

It can. Then it is not "superintelligence".

Superintelligence is capable of almost unlimited self-improvement.

(Even our miserable recursive self-improvement AI experiments show rather impressive results before saturating. Well, they will not keep saturating forever. Currently, this self-improvement typically happens via rather awkward and semi-competent generation of novel Python code. Soon it will be done by better means (which we probably should not discuss here).)

Replies from: Roko

↑ comment by Roko · 2024-09-12T20:41:23.224Z · LW(p) · GW(p)

By your own definition of "superintelligence", it must be better at "being impotent" than any group of humans less than 10 billion. So it must be super-good at being impotent and doing very little, if that is required.

Replies from: mishka

↑ comment by mishka · 2024-09-13T05:54:00.272Z · LW(p) · GW(p)

Being impotent is not a property of "being good". One is not aiming for that.

It's just a limitation. One usually does not self-impose it (with rare exceptions), although one might want to impose it on adversaries.

"Being impotent" is always worse. One can't be "better at it".

One can be better at refraining from exercising the capability (we have a different branch in this discussion for that).

Replies from: Roko

↑ comment by Roko · 2024-09-13T17:44:11.772Z · LW(p) · GW(p)

One can be better at refraining from exercising the capability

If that is what is needed then it must (by definition) be better at it

Replies from: mishka

↑ comment by mishka · 2024-09-13T19:38:50.848Z · LW(p) · GW(p)

Not if it is disabling.

If it is disabling, then one has a self-contradictory situation (if ASI fundamentally disables itself, then it stops being more capable, and stops being an ASI, and can't keep exercising its superiority; it's the same as if it self-destructs).

Replies from: Roko

↑ comment by Roko · 2024-09-13T19:52:36.895Z · LW(p) · GW(p)

If a superintelligence is worse than a human at permanently disabling itself - given that as the only required task - then there is a task that it is subhuman at and therefore not a superintelligence.

Replies from: Roko

↑ comment by Roko · 2024-09-13T19:56:23.280Z · LW(p) · GW(p)

I suppose you could make some modifications to your definition to take account of this. But in any case, I think it's not a great definition as it make an implicit assumption about the structure of problems (that basically problems have a single "scalar" difficulty)

Replies from: mishka

↑ comment by mishka · 2024-09-13T20:03:15.715Z · LW(p) · GW(p)

No, it can disable itself.

But it is not a solution, it is a counterproductive action. It makes things worse.

(In some sense, it has an obligation not to irreversibly disable itself.)

comment by Anon User (anon-user) · 2024-09-12T05:02:25.725Z · LW(p) · GW(p)

Your proof actually fails to fully account for the fact that any ASI must actually exist in the world. It would affect the world other then just through its outputs - e.g. if it's computation produces heat, that heat would also affect the world. Your proof does not show that the sum of all effects of the ASI on the world (both intentional + side-effects of it performing its computation) could be aligned. Further, real computation takes time - it's not enough for the aligned ASI to produce the right output, it also needs to produce it at the right time. You did not prove it to be possible.

Replies from: Roko, Roko

↑ comment by Roko · 2024-09-12T12:09:44.717Z · LW(p) · GW(p)

it's not enough for the aligned ASI to produce the right output, it also needs to produce it at the right time

Yes, but again this is a mathematical object so it has effectively infinitely fast compute. But I can also prove that FA:BGROW - FA for "functional approximation" - will require less thinking time that human brains.

↑ comment by Roko · 2024-09-12T12:08:23.753Z · LW(p) · GW(p)

fact that any ASI must actually exist in the world

It's a mathematical existence proof that the ASI exists as a mathematical object, so this part is not necessary. However, I can also argue quite convincingly that an ASI similar to LT:BGROW (let's call it FA:BGROW - FA for "functional approximation) must easily fit in the world and also emit less waste heat than a team of human advisors.

Replies from: anon-user

↑ comment by Anon User (anon-user) · 2024-09-27T17:00:36.174Z · LW(p) · GW(p)

Perhaps you are missing the point of what I am saying here somewhat? The issue is is not the scale of the side-effect of a computation, it's the fact that the side-effect exists, so any accurate mathematical abstraction of an actual real-world ASI must be prepared to deal with solving a self-referential equation.

Replies from: Roko

↑ comment by Roko · 2024-10-10T22:48:48.310Z · LW(p) · GW(p)

mathematical abstraction of an actual real-world ASI

But it's not that: it's a mathematical abstraction of a disembodied ASI that lacks any physical footprint.

comment by Wei Dai (Wei_Dai) · 2024-09-28T20:55:36.630Z · LW(p) · GW(p)

Over time I have seen many people assert that “Aligned Superintelligence” may not even be possible in principle. I think that is incorrect and I will give a proof - without explicit construction - that it is possible.

The meta problem here is that you gave a "proof" (in quotes because I haven't verified it myself as correct) using your own definitions of "aligned" and "superintelligence", but if people asserting that it's not possible in principle have different definitions in mind, then you haven't actually shown them to be incorrect.

Replies from: Roko

↑ comment by Roko · 2024-10-10T22:50:04.932Z · LW(p) · GW(p)

I don't see how anyone could possibly argue with my definitions.

comment by xpym · 2024-09-16T10:20:06.830Z · LW(p) · GW(p)

We’ll say that a state is in fact reachable if a group of humans could in principle take actions with actuators - hands, vocal chords, etc - that could realize that state.

The main issue here is that groups of humans may in principle be capable of great many things, but there's a vast chasm between "in principle" and "in practice". A superintelligence worthy of the name would likely be able to come up with plans that we wouldn't in practice be able to even check exhaustively, which is the sort of issue that we want alignment for.

Replies from: Roko

↑ comment by Roko · 2024-09-16T12:33:20.813Z · LW(p) · GW(p)

This is not a problem for my argument. I am merely showing that any state reachable by humans, must also be reachable by AIs. It is fine if AIs can reach more states.

Replies from: xpym

↑ comment by xpym · 2024-09-16T12:57:55.700Z · LW(p) · GW(p)

Hmm, right. You only need assume that there are coherent reachable desirable outcomes. I'm doubtful that such an assumption holds, but most people probably aren't.

Replies from: Roko

↑ comment by Roko · 2024-09-17T09:50:57.208Z · LW(p) · GW(p)

I'm doubtful that such an assumption holds

Why?

Replies from: xpym

↑ comment by xpym · 2024-09-19T08:51:47.915Z · LW(p) · GW(p)

Because humans have incoherent preferences, and it's unclear whether a universal resolution procedure is achievable. I like how Richard Ngo put it, "there’s no canonical way to scale me up" [LW · GW].

Replies from: Roko

↑ comment by Roko · 2024-09-19T17:09:05.224Z · LW(p) · GW(p)

humans have incoherent preferences

This isn't really a problem with alignment so there's no need to address it here. Alignment means the transmission of a preference ordering to an action sequence. Lacking a coherent preference ordering for states of the universe (or histories, for that matter) is not an alignment problem.

Replies from: xpym

↑ comment by xpym · 2024-09-20T09:19:46.540Z · LW(p) · GW(p)

This isn’t really a problem with alignment

I'd rather put it that resolving that problem is a prerequisite for the notion of "alignment problem" to be meaningful in the first place. It's not technically a contradiction to have an "aligned" superintelligence that does nothing, but clearly nobody would in practice be satisfied with that.

Replies from: Roko

↑ comment by Roko · 2024-09-21T16:32:03.499Z · LW(p) · GW(p)

you can have an alignment problem without humans. E.g. two strawberries problem.

A Nonconstructive Existence Proof of Aligned Superintelligence

Contents

80 comments