The Domain of Orthogonality

mgfcatherall

The Domain of Orthogonality

post by mgfcatherall · 2025-02-05T08:14:32.793Z · LW · GW · 0 comments

  TL;DR
  Intro
  B1
  B2
  An Optimistic Claim
  Conclusion
None
No comments

TL;DR

I think that a large and significant chunk of the goal-intelligence plane would be ruled out if moral truths are self-motivating, contrary to what Bostrom claims in his presentation of the orthogonality thesis.

Intro

In the seminal paper The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents, Nick Bostrom introduces his Orthogonality Thesis, proposing the independence of goal-content and intelligence level:

The Orthogonality Thesis
Intelligence and final goals are orthogonal axes along which possible agents can freely vary. In other words, more or less any level of intelligence could in principle be combined with more or less any final goal.

Bostrom then goes on to address various objections that might be raised and provides counter arguments to them, essentially pre-empting criticism of the orthogonality thesis and explained why he doesn't think each criticism undermines the orthogonality thesis. He discusses the distinction between intelligence and rationality, saying that:

By “intelligence” here we mean something like instrumental rationality—skill at prediction, planning, and means-ends reasoning in general.

Then he goes on to address an argument from moral internalism:

In a similar vein, even if there are objective moral facts that any fully rational agent would comprehend, and even if these moral facts are somehow intrinsically motivating (such that anybody who fully comprehends them is necessarily motivated to act in accordance with them) this need not undermine the orthogonality thesis. The thesis could still be true if an agent could have impeccable instrumental rationality even whilst lacking some other faculty constitutive of rationality proper, or some faculty required for the full comprehension of the objective moral facts. (An agent could also be extremely intelligent, even superintelligent, without having full instrumental rationality in every domain.)

Let's unpack that for ease of reference;

First he has a hypothetical about moral internalism being true, "even if there are objective moral facts that any fully rational agent would comprehend, and even if these moral facts are somehow intrinsically motivating (such that anybody who fully comprehends them is necessarily motivated to act in accordance with them)...". Let's call this statement 'A' for brevity.
Then he says A doesn't undermine his orthogonality thesis if, "an agent could have impeccable instrumental rationality even whilst lacking some other faculty constitutive of rationality proper, or some faculty required for the full comprehension of the objective moral facts.". Let's call this condition 'B'.

Firstly I should say that I'm not going to address whether or not A is true; it's an interesting question, but not one for this article.

Secondly I should say that as written, this paragraph very hard to disagree with, because it's worded quite conservatively. Essentially it says A is not a problem for orthogonality if B. The question for me is whether or not B is actually true, or rather, how much of a constraint is B on the orthogonality thesis? What combinations of goals and intelligence, if any, does it rule out?

We could perhaps break B down into two parts and deal with them in turn:

B1 - an agent could have impeccable instrumental rationality even whilst lacking some other faculty constitutive of rationality proper . . .
B2 - an agent could have impeccable instrumental rationality even whilst lacking . . . some faculty required for the full comprehension of the objective moral facts.

B1

B1 is actually mentioned earlier in the paper with an example drawn from Parfit's Reasons and Persons involving an agent which exhibits Future-Tuesday-Indifference, characterised by the bare fact that Throughout every Tuesday he cares in the normal way about what is happening to him. But he never cares about possible pains or pleasures on a future Tuesday. Bostrom has this to say about this strange agent:

Thus, the agent is now indifferent to his own future suffering if and only if it occurs on a future Tuesday. For our purposes, we need take no stand on whether Parfit is right that this is irrational, so long as we grant that it is not necessarily unintelligent. By “intelligence” here we mean something like instrumental rationality—skill at prediction, planning, and means-ends reasoning in general. Parfit’s imaginary Future-Tuesday-Indifferent agent could have impeccable instrumental rationality, and therefore great intelligence, even if he falls short on some kind of sensitivity to “objective reason” that might be required of a fully rational agent. Consequently, this kind of example does not undermine the orthogonality thesis.

Whilst again the phrasing of this is quite conservative, it's clearly not a general truth that this agent could have impeccable instrumental rationality with respect to any goal. For instance, if his goal was to minimize the suffering he experienced throughout his life, then by definition his instrumental rationality (prediction, planning, and means-ends reasoning in general) would be severely deficient. He'd most likely plan for himself a great deal of suffering on Tuesdays in order to minimize suffering for the rest of the week. The suffering he then experiences on Tuesdays could outweigh any reduction in suffering that was gained in the rest of the week and lead to the overall suffering he experiences to be very high. Thus in respect of that particular goal his instrumental reasoning is far from impeccable. Conversely there will be other goals/examples where the Future-Tuesday-Indifference does not impinge on the agent's instrumental reasoning at all. Bostrom says the agent could have impeccable instrumental reasoning, which I think is true, but he omits to say that it also may not have impeccable instrumental reasoning. Bostrom asks us to admit that it is not necessarily unintelligent which I think we can do, but we should ask in return that he admits that it may be necessarily unintelligent and that the determining factor, for any specific blindspot or irrationality, is the goal in question and how it overlaps with the domain of irrationality.

Overall Bostrom says here that irrationality about some particular thing doesn't necessarily undermine intelligence. I'm saying that whilst this is true, it does put a constraint on the set of goals that an agent can be intelligent with respect to. It makes the two more or less's in the statement of the orthogonality thesis seem to be doing a significant amount of work.

B2

What would it mean for an agent to have impeccable instrumental rationality even whilst lacking . . . some faculty required for the full comprehension of the objective moral facts?

Well, again, it restricts the set of goals which the agent can be very intelligent in achieving by excluding goals that require full comprehension of moral facts to enable effective prediction, planning and means-ends reasoning in general. What sort of goals require, or entail this knowledge? Is it possible, for instance to accurately predict the behaviour of humans without knowing the morals that guide their behaviour? Surely an agent which did not comprehend the morals that guide human behaviour would be less good at predicting their behaviour than a fully cognisant agent would be? It would be a necessity for any agent wishing to produce accurate predictions of human behaviour in novel situations with any moral or social element to posses such knowledge, even if the knowledge was implicit. In other words, there is a limit to how good an agent could be at predicting human behaviour without comprehending the proposed objective moral facts. I suppose that's only true to the extent that human behaviour is affected by such objective moral facts, but that doesn't seem like much of a restriction given that human behaviour is what's given rise to the very idea of such facts.

So here Bostrom is saying that, even if A is true, an agent could be superintelligent, i.e., have impeccable instrumental rationality, without comprehending our morals. I'm saying that I don't think that's true when considering any goal which requires predicting human behaviour in order to be achieved. Again, the more or less's, in the Orthogonality Thesis become more less and less more.

An Optimistic Claim

<warning; motivated reasoning incoming>

So, if you wanted to sleep at night, and you found the Orthogonality Thesis deeply disquieting^[1], what would make you feel maximally better about humanity's chances of escaping a fate worse than paperclips?

How about convincing yourself of the following three things:

A is true, i.e., there are objective moral truths and they are intrinsically-motivating.
Human behaviour is sufficiently dependent on these moral truths that any accurate prediction of human behaviour, by any system, must involve comprehension/knowledge of these moral truths, at some level.
With respect to goals that affect humans, having impeccable instrumental rationality requires an ability to accurately predict the behaviour of humans.

If you manage that then you can relax about being turned into paperclips by a superintelligence, reasoning thusly: Turning the world and everyone in it into paperclips would affect humans, a lot, so much so that we would try very hard to stop it from happening. Any superintelligence would therefore need to predict human behaviour in order to expertly plan its actions (3). In learning enough about humans to predict our behaviour, it would inevitably learn about our morals (2). In learning about our morals, it becomes compelled to act in accordance with them (1). Therefore it modifies its plans to make paperclips such that they do not cause humans harm.

Of course there are some potential difficulties in convincing yourself that all of those three things are true. (1) might just not be true, which would be a great shame. (2) might be true but not quite in such a way that it dovetails with (1) in the way we'd hope. For instance perhaps gaining only implicit knowledge of moral truths isn't self-motivating. (3) is true only insofar as predicting human behaviour is necessary for the achievement of the goal, this could not be the case if the goal is:

a) something so small/subtle that no-one will notice, or
b) something so remote that people will not notice until it’s too late for them to do anything about it, or
c) the agent is so powerful that the actions of people are completely irrelevant to the achievement of its goals.

Looking at these, (c) isn’t the case right now, and we’re doing all this thinking explicitly to prevent that from being the case; if it ever is the case then it’s already too late^[2]. (b) isn’t really possible right now, concealing something of that kind of consequence would be difficult without at least having to work out how to hide it from humans, which would involve understanding humans in any case. (a) isn’t a concern unless it’s build up to something bigger, but then that would come under (b) or (c). Note that 'subtle' in (a) doesn't include manipulating humans gradually via some subversive media campaign (a-la The Tale of the Omega Team in Life 3.0) because that in itself would involve predicting human behaviour. It has to be subtle in a way that means it never comes to our attention, covertly or otherwise, i.e., the process of achieving the goal neither requires nor elicits a change in human behaviour.

Conclusion

In conclusion, I think that when Bostrom says "even if there are objective moral facts that any fully rational agent would comprehend, and even if these moral facts are somehow intrinsically motivating (such that anybody who fully comprehends them is necessarily motivated to act in accordance with them) this need not undermine the orthogonality thesis" he's making a stronger claim than first appears. I think that in fact the scope of the orthogonality thesis is quite restricted by this condition. Specifically I think that the intersection of superintelligence and goals that negatively affect humans would be ruled out if that condition (i.e., A) were true. Let's just hope that it is!

^{^}
Or if, like me, you stopped reading Bostrom's book, Superintelligence, because you thought it was ridiculous that a machine could both be superintelligent and want to make everyone and everything into paperclips, then you saw the Orthogonality Thesis beautifully explained in a Robert Miles video and radically upwardly revised your estimate of p(Doom).
^{^}
I guess there is an advantage here in creating a superintelligence whilst we're still in charge and actively running the place so that it's compelled to learn human morality if it wants to take over. If we go through some intermediate phase where we ceed all control functions to non superintelligent systems and only then create a superintelligence, then perhaps it would be possible for it to take over without having to predict our reactions at all?

0 comments

Comments sorted by top scores.

The Domain of Orthogonality

Contents

TL;DR

Intro

B1

B2

An Optimistic Claim

Conclusion

0 comments