The Obliqueness Thesis

jessica-liu-taylor

The Obliqueness Thesis

post by jessicata (jessica.liu.taylor) · 2024-09-19T00:26:30.677Z · LW · GW · 19 comments

  Bayes/VNM point against Orthogonality
  Belief/value duality
  Logical uncertainty as a model for bounded rationality
  Naive belief/value factorizations lead to optimization daemons
  Intelligence changes the ontology values are expressed in
  Intelligence leads to recognizing value-relevant symmetries
  Human brains don't seem to neatly factorize
  Models of ASI should start with realism
  On Yudkowsky's arguments
  Conclusion
None
19 comments

In my Xenosystems review [LW · GW], I discussed the Orthogonality Thesis, concluding that it was a bad metaphor. It's a long post, though, and the comments on orthogonality build on other Xenosystems content. Therefore, I think it may be helpful to present a more concentrated discussion on Orthogonality, contrasting Orthogonality with my own view, without introducing dependencies on Land's views. (Land gets credit for inspiring many of these thoughts, of course, but I'm presenting my views as my own here.)

First, let's define the Orthogonality Thesis. Quoting Superintelligence for Bostrom's formulation:

Intelligence and final goals are orthogonal: more or less any level of intelligence could in principle be combined with more or less any final goal.

To me, the main ambiguity about what this is saying is the "could in principle" part; maybe, for any level of intelligence and any final goal, there exists (in the mathematical sense) an agent combining those, but some combinations are much more natural and statistically likely than others. Let's consider Yudkowsky's formulations as alternatives. Quoting Arbital:

The Orthogonality Thesis asserts that there can exist arbitrarily intelligent agents pursuing any kind of goal.

The strong form of the Orthogonality Thesis says that there's no extra difficulty or complication in the existence of an intelligent agent that pursues a goal, above and beyond the computational tractability of that goal.

As an example of the computational tractability consideration, sufficiently complex goals may only be well-represented by sufficiently intelligent agents. "Complication" may be reflected in, for example, code complexity; to my mind, the strong form implies that the code complexity of an agent with a given level of intelligence and goals is approximately the code complexity of the intelligence plus the code complexity of the goal specification, plus a constant. Code complexity would influence statistical likelihood for the usual Kolmogorov/Solomonoff reasons, of course.

I think, overall, it is more productive to examine Yudkowsky's formulation than Bostrom's, as he has already helpfully factored the thesis into weak and strong forms. Therefore, by criticizing Yudkowsky's formulations, I am less likely to be criticizing a strawman. I will use "Weak Orthogonality" to refer to Yudkowsky's "Orthogonality Thesis" and "Strong Orthogonality" to refer to Yudkowsky's "strong form of the Orthogonality Thesis".

Land, alternatively, describes a "diagonal" between intelligence and goals as an alternative to orthogonality, but I don't see a specific formulation of a "Diagonality Thesis" on his part. Here's a possible formulation:

Diagonality Thesis: Final goals tend to converge to a point as intelligence increases.

The main criticism of this thesis is that formulations of ideal agency, in the form of Bayesianism and VNM utility, leave open free parameters, e.g. priors over un-testable propositions, and the utility function. Since I expect few readers to accept the Diagonality Thesis, I will not concentrate on criticizing it.

What about my own view? I like Tsvi's naming [LW(p) · GW(p)] of it as an "obliqueness thesis".

Obliqueness Thesis: The Diagonality Thesis and the Strong Orthogonality Thesis are false. Agents do not tend to factorize into an Orthogonal value-like component and a Diagonal belief-like component; rather, there are Oblique components that do not factorize neatly.

(Here, by Orthogonal I mean basically independent of intelligence, and by Diagonal I mean converging to a point in the limit of intelligence.)

While I will address Yudkowsky's arguments for the Orthogonality Thesis, I think arguing directly for my view first will be more helpful. In general, it seems to me that arguments for and against the Orthogonality Thesis are not mathematically rigorous; therefore, I don't need to present a mathematically rigorous case to contribute relevant considerations, so I will consider intuitive arguments relevant, and present multiple arguments rather than a single sequential argument (as I did with the more rigorous argument for many worlds).

Bayes/VNM point against Orthogonality

Some people may think that the free parameters in Bayes/VNM point towards the Orthogonality Thesis being true. I think, rather, that they point against Orthogonality. While they do function as arguments against the Diagonality Thesis, this is insufficient for Orthogonality.

First, on the relationship between intelligence and bounded rationality. It's meaningless to talk about intelligence without a notion of bounded rationality. Perfect rationality in a complex environment is computationally intractable. With lower intelligence, bounded rationality is necessary. So, at non-extreme intelligence levels, the Orthogonality Thesis must be making a case that boundedly rational agents can have any computationally tractable goal.

Bayesianism and VNM expected utility optimization are known to be computationally intractable in complex environments. That is why algorithms like MCMC and reinforcement learning are used. So, making an argument for Orthogonality in terms of Bayesianism and VNM is simply dodging the question, by already assuming an extremely high intelligence level from the start.

As the Orthogonality Thesis refers to "values" or "final goals" (which I take to be synonymous), it must have a notion of the "values" of agents that are not extremely intelligent. These values cannot be assumed to be VNM, since VNM is not computationally tractable. Meanwhile, money-pumping arguments suggest that extremely intelligent agents will tend to converge to VNM-ish preferences. Thus:

Argument from Bayes/VNM: Agents with low intelligence will tend to have beliefs/values that are far from Bayesian/VNM. Agents with high intelligence will tend to have beliefs/values that are close to Bayesian/VNM. Strong Orthogonality is false because it is awkward to combine low intelligence with Bayesian/VNM beliefs/values, and awkward to combine high intelligence with far-from-Bayesian/VNM beliefs/values. Weak Orthogonality is in doubt, because having far-from-Bayesian/VNM beliefs/values puts a limit on the agent's intelligence.

To summarize: un-intelligent agents cannot be assumed to be Bayesian/VNM from the start. Those arise at a limit of intelligence, and arguably have to arise due to money-pumping arguments. Beliefs/values therefore tend to become more Bayesian/VNM with high intelligence, contradicting Strong Orthogonality and perhaps Weak Orthogonality.

One could perhaps object that logical uncertainty allows even weak agents to be Bayesian over combined physical/mathematical uncertainty; I'll address this consideration later.

Belief/value duality

It may be unclear why the Argument from Bayes/VNM refers to both beliefs and values, as the Orthogonality Thesis is only about values. It would, indeed, be hard to make the case that the Orthogonality Thesis is true as applied to beliefs. However, various arguments suggest that Bayesian beliefs and VNM preferences are "dual" such that complexity can be moved from one to the other.

Abram Demski has presented this general idea [LW · GW] in the past, and I'll give a simple example to illustrate.

Let be the agent’s action, and let $W \in W$ represent the state of the world prior to / unaffected by the agent’s action Let r(A, W) be the outcome resulting from the action and world. Let P(w) be the primary agent’s probability a given world. Let U(o) be the primary agent’s utility for outcome o. The primary agent finds an action a to maximize $\sum_{w \in W} P (w) U (r (a, w))$ .

Now let e be an arbitrary predicate on worlds. Consider modifying P to increase the probability that e(W) is true. That is:

$P^{'} (w) :\propto P (w) (1 + [e (w)])$

$P^{'} (w) = \frac{P (w) (1 + [e (w)])}{\sum_{w \in W} P (w) (1 + [e (w)])}$

where [e(w)] equals 1 if e(w), otherwise 0. Now, can we define a modified utility function U’ so a secondary agent with beliefs P’ and utility function U’ will take the same action as the primary agent? Yes:

$U^{'} (o) := \frac{U (o)}{1 + [e (w)]}$

This secondary agent will find an action a to maximize:

$\sum w \in W P^{'} (w) U^{'} (r (a, w))$

$= \sum w \in W \frac{P (w) (1 + [e (w)])}{\sum_{w^{'} \in W} P (w^{'}) (1 + [e (w^{'})])} \frac{U (r (a, w))}{1 + [e (w)]}$

$= \frac{1}{\sum_{w \in W} P (w) (1 + [e (w)])} \sum w \in W P (w) U (r (a, w))$

Clearly, this is a positive constant times the primary agent's maximization target, so the secondary agent will take the same action.

This demonstrates a basic way that Bayesian beliefs and VNM utility are dual to each other. One could even model all agents as having the same utility function (of maximizing a random variable U) and simply having different beliefs about what U values are implied by the agent's action and world state. Thus:

Argument from belief/value duality: From an agent's behavior, multiple belief/value combinations are valid attributions. This is clearly true in the limiting Bayes/VNM case, suggesting it also applies in the case of bounded rationality. It is unlikely that the Strong Orthogonality Thesis applies to beliefs (including priors), so, due to the duality, it is also unlikely that it applies to values.

I consider this weaker than the Argument from Bayes/VNM. Someone might object that both values and a certain component of beliefs are orthogonal, while the other components of beliefs (those that change with more reasoning/intelligence) aren't. But I think this depends on a certain factorizability of beliefs/values into the kind that change on reflection and those that don't, and I'm skeptical of such factorizations. I think discussion of logical uncertainty will make my position on this clearer, though, so let's move on.

Logical uncertainty as a model for bounded rationality

I've already argued that bounded rationality is essential to intelligence (and therefore the Orthogonality Thesis). Logical uncertainty is a form of bounded rationality (as applied to guessing the probabilities of mathematical statements). Therefore, discussing logical uncertainty is likely to be fruitful with respect to the Orthogonality Thesis.

Logical Induction is a logical uncertainty algorithm that produces a probability table for a finite subset of mathematical statements at each iteration. These beliefs are determined by a betting market of an increasing (up to infinity) number of programs that make bets, with the bets resolved by a "deductive process" that is basically a theorem prover. The algorithm is computable, though extremely computationally intractable, and has properties in the limit including some forms of Bayesian updating, statistical learning, and consistency over time.

We can see Logical Induction as evidence against the Diagonality Thesis: beliefs about undecidable statements (which exist in consistent theories due to Gödel's first incompleteness theorem) can take on any probability in the limit, though satisfy properties such as consistency with other assigned probabilities (in a Bayesian-like manner).

However, (a) it is hard to know ahead of time which statements are actually undecidable, (b) even beliefs about undecidable statements tend to predictably change over time to Bayesian consistency with other beliefs about undecidable statements. So, Logical Induction does not straightforwardly factorize into a "belief-like" component (which converges on enough reflection) and a "value-like" component (which doesn't change on reflection). Thus:

Argument from Logical Induction: Logical Induction is a current best-in-class model of theoretical asymptotic bounded rationality. Logical Induction is non-Diagonal, but also clearly non-Orthogonal, and doesn't apparently factorize into separate Orthogonal and Diagonal components. Combined with considerations from "Argument from belief/value duality", this suggests that it's hard to identify all value-like components in advanced agents that are Orthogonal in the sense of not tending to change upon reflection.

One can imagine, for example, introducing extra function/predicate symbols into the logical theory the logical induction is over, to represent utility. Logical induction will tend to make judgments about these functions/predicates more consistent and inductively plausible over time, changing its judgments about the utilities of different outcomes towards plausible logical probabilities. This is an Oblique (non-Orthogonal and non-Diagonal) change in the interpretation of the utility symbol over time.

Likewise, Logical Induction can be specified to have beliefs over empirical facts such as observations by adding additional function/predicate symbols, and can perhaps update on these as they come in (although this might contradict UDT-type considerations). Through more iteration, Logical Inductors will come to have more approximately Bayesian, and inductively plausible, beliefs about these empirical facts, in an Oblique fashion.

Even if there is a way of factorizing out an Orthogonal value-like component from an agent, the belief-component (represented by something like Logical Induction) remains non-Diagonal, so there is still a potential "alignment problem" for these non-Diagonal components to match, say, human judgments in the limit. I don't see evidence that these non-Diagonal components factor into a value-like "prior over the undecidable" that does not change upon reflection. So, there remain components of something analogous to a "final goal" (by belief/value duality) that are Oblique, and within the scope of alignment.

If it were possible to get the properties of Logical Induction in a Bayesian system, which makes Bayesian updates on logical facts over time, that would make it more plausible that an Orthogonal logical prior could be specified ahead of time. However, MIRI researchers have tried for a while to find Bayesian interpretations of Logical Induction, and failed, as would be expected from the Argument from Bayes/VNM.

Naive belief/value factorizations lead to optimization daemons

The AI alignment field has a long history of poking holes in alignment approaches. Oops, you tried making an oracle AI and it manipulated real-world outcomes to make its predictions true. Oops, you tried to do Solomonoff induction and got invaded by aliens. Oops, you tried getting agents to optimize over a virtual physical universe, and they discovered the real world and tried to break out. Oops, you ran a Logical Inductor and one of the traders manipulated the probabilities to instantiate itself in the real world.

These sub-processes that take over are known as optimization daemons. When you get the agent architecture wrong, sometimes a sub-process (that runs a massive search over programs, such as with Solomonoff Induction) will luck upon a better agent architecture and out-compete the original system. (See also a very strange post [LW · GW] I wrote some years back while thinking about this issue, and Christiano's comment relating it to Orthogonality).

If you apply a naive belief/value factorization to create an AI architecture, when compute is scaled up sufficiently, optimization daemons tend to break out, showing that this factorization was insufficient. Enough experiences like this lead to the conclusion that, if there is a realistic belief/value factorization at all, it will look pretty different from the naive one. Thus:

Argument from optimization daemons: Naive ways of factorizing an agent into beliefs/values tend to lead to optimization daemons, which have different values from in the original factorization. Any successful belief/value factorization will probably look pretty different from the naive one, and might not take the form of factorization into Diagonal belief-like components and Orthogonal value-like components. Therefore, if any realistic formulation of Orthogonality exists, it will be hard to find and substantially different from naive notions of Orthogonality.

Intelligence changes the ontology values are expressed in

The most straightforward way to specify a utility function is to specify an ontology (a theory of what exists, similar to a database schema) and then provide a utility function over elements of this ontology. Prior to humans learning about physics, evolution (taken as a design algorithm for organisms involving mutation and selection) did not know all that human physicists know. Therefore, human evolutionary values are unlikely to be expressed in the ontology of physics as physicists currently believe in.

Human evolutionary values probably care about things like eating enough, social acceptance, proxies for reproduction, etc. It is unknown how these are specified, but perhaps sensory signals (such as stomach signals) are connected with a developing world model over time. Humans can experience vertigo at learning physics, e.g. thinking that free will and morality are fake, leading to unclear applications of native values to a realistic physical ontology. Physics has known gaps (such as quantum/relativity correspondence, and dark energy/dark matter) that suggest further ontology shifts.

One response to this vertigo is to try to solve the ontology identification problem; find a way of translating states in the new ontology (such as physics) to an old one (such as any kind of native human ontology), in a structure-preserving way, such that a utility function over the new ontology can be constructed as a composition of the original utility function and the new-to-old ontological mapping. Current solutions, such as those discussed in MIRI's Ontological Crises paper, are unsatisfying. Having looked at this problem for a while, I'm not convinced there is a satisfactory solution within the constraints presented. Thus:

Argument from ontological change: More intelligent agents tend to change their ontology to be more realistic. Utility functions are most naturally expressed relative to an ontology. Therefore, there is a correlation between an agent's intelligence and utility function, through the agent's ontology as an intermediate variable, contradicting Strong Orthogonality. There is no known solution for rescuing the old utility function in the new ontology, and some research intuitions pointing towards any solution being unsatisfactory in some way.

If a satisfactory solution is found, I'll change my mind on this argument, of course, but I'm not convinced such a satisfactory solution exists. To summarize: higher intelligence causes ontological changes, and rescuing old values seems to involve unnatural "warps" to make the new ontology correspond with the old one, contradicting at least Strong Orthogonality, and possibly Weak Orthogonality (if some values are simply incompatible with realistic ontology). Paperclips, for example, tend to appear most relevant at an intermediate intelligence level (around human-level), and become more ontologically unnatural at higher intelligence levels.

As a more general point, one expects possible mutual information between mental architecture and values, because values that "re-use" parts of the mental architecture achieve lower description length. For example, if the mental architecture involves creating universal algebra structures and finding analogies between them and the world, then values expressed in terms of such universal algebras will tend to have lower relative description complexity to the architecture. Such mutual information contradicts Strong Orthogonality, as some intelligence/value combinations are more natural than others.

Intelligence leads to recognizing value-relevant symmetries

Consider a number of un-intutitive value propositions people have argued for:

Torture is preferable to Dust Specks [LW · GW], because it's hard to come up with a utility function with the alternative preference without horrible unintuitive consequences elsewhere.
People are way too risk-averse in betting; the implied utility function has too strong diminishing marginal returns to be plausible.
You may think your personal identity is based on having the same atoms, but you're wrong, because you're distinguishing identical configurations [LW · GW].
You may think a perfect upload of you isn't conscious (and basically another copy of you), but you're wrong, because functionalist theory of mind is true.
You intuitively accept the premises of the Repugnant Conclusion, but not the Conclusion itself; you're simply wrong about one of the premises, or the conclusion.

The point is not to argue for these, but to note that these arguments have been made and are relatively more accepted among people who have thought more about the relevant issues than people who haven't. Thinking tends to lead to noticing more symmetries and dependencies between value-relevant objects, and tends to adjust values to be more mathematically plausible and natural. Of course, extrapolating this to superintelligence leads to further symmetries. Thus:

Argument from value-relevant symmetries: More intelligent agents tend to recognize more symmetries related to value-relevant entities. They will also tend to adjust their values according to symmetry considerations. This is an apparent value change, and it's hard to see how it can instead be factored as a Bayesian update on top of a constant value function.

I'll examine such factorizations in more detail shortly.

Human brains don't seem to neatly factorize

This is less about the Orthogonality Thesis generally, and more about human values. If there were separable "belief components" and "value components" in the human brain, with the value components remaining constant over time, that would increase the chance that at least some Orthogonal component can be identified in human brains, corresponding with "human values" (though, remember, the belief-like component can also be Oblique rather than Diagonal).

However, human brains seem much more messy than the sort of computer program that could factorize this way. Different brain regions are connected in at least some ways that are not well-understood. Additionally, even apparent "value components" may be analogous to something like a deep Q-learning function, which incorporates empirical updates in addition to pre-set "values".

The interaction between human brains and language is also relevant. Humans develop values they act on partly through language. And language (including language reporting values) is affected by empirical updates and reflection, thus non-Orthogonal. Reflecting on morality can easily change people's expressed and acted-upon values, e.g. in the case of Peter Singer. People can change which values they report as instrumental or terminal even while behaving similarly (e.g. flipping between selfishness-as-terminal and altruism-as-terminal), with the ambiguity hard to resolve because most behavior relates to convergent instrumental goals.

Maybe language is more of an effect than cause of values. But there really seems to be feedback from language to non-linguistic brain functions that decide actions and so on. Attributing coherent values over realistic physics to the brain parts that are non-linguistic seems like a form of projection or anthropomorphism. Language and thought have a function in cognition and attaining coherent values over realistic ontologies. Thus:

Argument from brain messiness: Human brains don't seem to neatly factorize into a belief-component and a value-component, with the value-component unaffected by reflection or language (which it would need to be Orthogonal). To the extent any value-component does not change due to language or reflection, it is restricted to evolutionary human ontology, which is unlikely to apply to realistic physics; language and reflection are part of the process that refines human values, rather than being an afterthought of them. Therefore, if the Orthogonality Thesis is true, humans lack identifiable values that fit into the values axis of the Orthogonality Thesis.

This doesn't rule out that Orthogonality could apply to superintelligences, of course, but it does raise questions for the project of aligning superintelligences with human values; perhaps such values do not exist or are not formulated so as to apply to the actual universe.

Models of ASI should start with realism

Some may take arguments against Orthogonality to be disturbing at a value level, perhaps because they are attached to research projects such as Friendly AI (or more specific approaches), and think questioning foundational assumptions would make the objective (such as alignment with already-existing human values) less clear. I believe "hold off on proposing solutions" [LW · GW] applies here: better strategies are likely to come from first understanding what is likely to happen absent a strategy, then afterwards looking for available degrees of freedom.

Quoting Yudkowsky:

Orthogonality is meant as a descriptive statement about reality, not a normative assertion. Orthogonality is not a claim about the way things ought to be; nor a claim that moral relativism is true (e.g. that all moralities are on equally uncertain footing according to some higher metamorality that judges all moralities as equally devoid of what would objectively constitute a justification). Claiming that paperclip maximizers can be constructed as cognitive agents is not meant to say anything favorable about paperclips, nor anything derogatory about sapient life.

Likewise, Obliqueness does not imply that we shouldn't think about the future and ways of influencing it, that we should just give up on influencing the future because we're doomed anyway, that moral realist philosophers are correct or that their moral theories are predictive of ASI, that ASIs are necessarily morally good, and so on. The Friendly AI research program was formulated based on descriptive statements believed at the time, such as that an ASI singleton would eventually emerge, that the Orthogonality Thesis is basically true, and so on. Whatever cognitive process formulated this program would have formulated a different program conditional on different beliefs about likely ASI trajectories. Thus:

Meta-argument from realism: Paths towards beneficially achieving human values (or analogues, if "human values" don't exist) in the far future likely involve a lot of thinking about likely ASI trajectories absent intervention. The realistic paths towards human influence on the far future depend on realistic forecasting models for ASI, with Orthogonality/Diagonality/Obliqueness as alternative forecasts. Such forecasting models can be usefully thought about prior to formulation of a research program intended to influence the far future. Formulating and working from models of bounded rationality such as Logical Induction is likely to be more fruitful than assuming that bounded rationality will factorize into Orthogonal and Diagonal components without evidence in favor of this proposition. Forecasting also means paying more attention to the Strong Orthogonality Thesis than the Weak Orthogonality Thesis, as statistical correlations between intelligence and values will show up in such forecasts.

On Yudkowsky's arguments

Now that I've explained my own position, addressing Yudkowsky's main arguments may be useful. His main argument has to do with humans making paperclips instrumentally:

Suppose some strange alien came to Earth and credibly offered to pay us one million dollars' worth of new wealth every time we created a paperclip. We'd encounter no special intellectual difficulty in figuring out how to make lots of paperclips.

That is, minds would readily be able to reason about:

How many paperclips would result, if I pursued a policy $π_{0}$ ?

How can I search out a policy $π$ that happens to have a high answer to the above question?

I believe it is better to think of the payment as coming in the far future and perhaps in another universe; that way, the belief about future payment is more analogous to terminal values than instrumental values. In this case, creating paperclips is a decent proxy for achievement of human value, so long-termist humans would tend to want lots of paperclips to be created.

I basically accept this, but, notably, Yudkowsky's argument is based on belief/value duality. He thinks it would be awkward for the reader to imagine terminally wanting paperclips, so he instead asks them to imagine a strange set of beliefs leading to paperclip production being oddly correlated with human value achievement. Thus, acceptance of Yudkowsky's premises here will tend to strengthen the Argument from belief/value duality and related arguments.

In particular, more intelligence would cause human-like agents to develop different beliefs about what actions aliens are likely to reward, and what numbers of paperclips different policies result in. This points towards Obliqueness as with Logical Induction: such beliefs will be revised (but not totally convergent) over time, leading to applying different strategies toward value achievement. And ontological issues around what counts as a paperclip will come up at some point, and likely be decided in a prior-dependent but also reflection-dependent way.

Beliefs about which aliens are most capable/honest likely depend on human priors, and are therefore Oblique: humans would want to program an aligned AI to mostly match these priors while revising beliefs along the way, but can't easily factor out their prior for the AI to share.

Now onto other arguments. The "Size of mind design space" argument implies many agents exist with different values from humans, which agrees with Obliqueness (intelligent agents tend to have different values from unintelligent ones). It's more of an argument about the possibility space than statistical correlation, thus being more about Weak than Strong Orthogonality.

The "Instrumental Convergence" argument doesn't appear to be an argument for Orthogonality per se; rather, it's a counter to arguments against Orthogonality based on noticing convergent instrumental goals. My arguments don't take this form.

Likewise, "Reflective Stability" is about a particular convergent instrumental goal (preventing value modification). In an Oblique framing, a Logical Inductor will tend not to change its beliefs about even un-decidable propositions too often (as this would lead to money-pumps), so consistency is valued all else being equal.

While I could go into more detail responding to Yudkowsky, I think space is better spent presenting my own Oblique views for now.

Conclusion

As an alternative to the Orthogonality Thesis and the Diagonality Thesis, I present the Obliqueness Thesis, which says that increasing intelligence tends to lead to value changes but not total value convergence. I have presented arguments that advanced agents and humans do not neatly factor into Orthogonal value-like components and Diagonal belief-like components, using Logical Induction as a model of bounded rationality. This implies complications to theories of AI alignment based on assuming humans have values and we need the AGI to agree about those values, while increasing their intelligence (and thus changing beliefs).

At a methodological level, I believe it is productive to start by forecasting default ASI using models of bounded rationality, especially known models such as Logical Induction, and further developing such models. I think this is more productive than assuming that these models will take the form of a belief/value factorization, although I have some uncertainty about whether such a factorization will be found.

If the Obliqueness Thesis is accepted, what possibility space results? One could think of this as steering a boat in a current of varying strength. Clearly, ignoring the current and just steering where you want to go is unproductive, as is just going along with the current and not trying to steer at all. Getting to where one wants to go consists in largely going with the current (if it's strong enough), charting a course that takes it into account.

Assuming Obliqueness, it's not viable to have large impacts on the far future without accepting some value changes that come from higher intelligence (and better epistemology in general). The Friendly AI research program already accepts that paths towards influencing the far future involve "going with the flow" regarding superintelligence, ontology changes, and convergent instrumental goals; Obliqueness says such flows go further than just these, being hard to cleanly separate from values.

Obliqueness obviously leaves open the question of just how oblique. It's hard to even formulate a quantitative question here. I'd very intuitively and roughly guess that intelligence and values are 3 degrees off (that is, almost diagonal), but it's unclear what question I am even guessing the answer to. I'll leave formulating and answering the question as an open problem.

I think Obliqueness is realistic, and that it's useful to start with realism when thinking of how to influence the far future. Maybe superintelligence necessitates significant changes away from current human values; the Litany of Tarski [? · GW] applies. But this post is more about the technical thesis than emotional processing of it, so I'll end here.

19 comments

Comments sorted by top scores.

comment by Wei Dai (Wei_Dai) · 2024-09-19T15:52:13.011Z · LW(p) · GW(p)

As long as all mature superintelligences in our universe don't necessarily have (end up with) the same values, and only some such values can be identified with our values or what our values should be, AI alignment seems as important as ever. You mention "complications" from obliqueness, but haven't people like Eliezer recognized similar complications pretty early, with ideas such as CEV?

It seems to me that from a practical perspective, as far as what we should do, your view is much closer to Eliezer's view than to Land's view (which implies that alignment doesn't matter and we should just push to increase capabilities/intelligence). Do you agree/disagree with this?

It occurs to me that maybe you mean something like "Our current (non-extrapolated) values are our real values, and maybe it's impossible to build or become a superintelligence that shares our real values so we'll have to choose between alignment and superintelligence." Is this close to your position?

Replies from: jessica.liu.taylor

↑ comment by jessicata (jessica.liu.taylor) · 2024-09-19T16:56:19.428Z · LW(p) · GW(p)

"as important as ever": no, because our potential influence is lower, and the influence isn't on things shaped like our values, there has to be a translation, and the translation is different from the original.

CEV: while it addresses "extrapolation" it seems broadly based on assuming the extrapolation is ontologically easy, and "our CEV" is an unproblematic object we can talk about (even though it's not mathematically formalized, any formalization would be subject to doubt, and even if formalized, we need logical uncertainty over it, and logical induction has additional free parameters in the limit). I'm really trying to respond to orthogonality not CEV though.

from a practical perspective: notice that I am not behaving like Eliezer Yudkowsky. I am not saying the Orthogonality Thesis is true and important to ASI, I am instead saying intelligence/values are Oblique and probably nearly Diagonal (though it's unclear what I mean by "nearly"). I am not saying a project of aligning superintelligence with human values is a priority. I am not taking research approaches that assume a Diagonal/Orthogonal factorization. I left MIRI partially because I didn't like their security policies (and because I had longer AI timelines), I thought discussion of abstract research ideas was more important. I am not calling for a global AI shutdown so this project (which is in my view confused) can be completed. I am actually against AI regulation on the margin (I don't have a full argument for this, it's a political matter at this point).

I think practicality looks more like having near-term preferences related to modest intelligence increases (as with current humans vs humans with neural nets; how do neural nets benefit or harm you, practically? how can you use them to think better and improve your life?), and not expecting your preferences to extend into the distant future with many ontology changes, so don't worry about grabbing hold of the whole future etc, think about how to reduce value drift while accepting intelligence increases on the margin. This is a bit like CEV except CEV is in a thought experiment instead of reality.

The "Models of ASI should start with realism" bit IS about practicalities, namely, I think focusing on first forecasting absent a strategy of what to do about the future is practical with respect to any possible influence on the far future; practically, I think your attempted jump to practicality (which might be related to philosophical pragmatism) is impractical in this context.

It occurs to me that maybe you mean something like "Our current (non-extrapolated) values are our real values, and maybe it's impossible to build or become a superintelligence that shares our real values so we'll have to choose between alignment and superintelligence." Is this close to your position?

Close. Alignment of already-existing human values with superintelligence is impossible (I think) because of the arguments given. That doesn't mean humans have no preferences indirectly relating to superintelligence (especially, we have preferences about modest intelligence increases, and there's some iterative process).

Replies from: Wei_Dai

↑ comment by Wei Dai (Wei_Dai) · 2024-09-19T17:55:43.532Z · LW(p) · GW(p)

What do you think about my positions on these topics as laid out in and Six Plausible Meta-Ethical Alternatives [LW · GW] and Ontological Crisis in Humans [LW · GW]?

My overall position can be summarized as being uncertain about a lot of things, and wanting (some legitimate/trustworthy group, i.e., not myself as I don't trust myself with that much power) to "grab hold of the whole future" in order to preserve option value, in case grabbing hold of the whole future turns out to be important. (Or some other way of preserving option value, such as preserving the status quo / doing AI pause.) I have trouble seeing how anyone can justifiably conclude "so don’t worry about grabbing hold of the whole future" as that requires confidently ruling out various philosophical positions as false, which I don't know how to do. Have you reflected a bunch and really think you're justified in concluding this?

E.g. in Ontological Crisis in Humans I wrote "Maybe we can solve many ethical problems simultaneously by discovering some generic algorithm that can be used by an agent to transition from any ontology to another?" which would contradict your "not expecting your preferences to extend into the distant future with many ontology changes" and I don't know how to rule this out. You wrote in the OP "Current solutions, such as those discussed in MIRI’s Ontological Crises paper, are unsatisfying. Having looked at this problem for a while, I’m not convinced there is a satisfactory solution within the constraints presented." but to me this seems like very weak evidence for the problem being actually unsolvable.

Replies from: jessica.liu.taylor

↑ comment by jessicata (jessica.liu.taylor) · 2024-09-19T23:11:14.206Z · LW(p) · GW(p)

re meta ethical alternatives:

roughly my view
slight change, opens the question of why the deviations? are the "right things to value" not efficient to value in a competitive setting? mostly I'm trying to talk about those things to value that go along with intelligence, so it wouldn't correspond with a competitive disadvantage in general. so it's still close enough to my view
roughly Yudkowskian view, main view under which the FAI project even makes sense. I think one can ask basic questions like which changes move towards more rationality on the margin, though such changes would tend to prioritize rationality over preventing value drift. I'm not sure how much there are general facts about how to avoid value drift (it seems like the relevant kind, i.e. value drift as part of becoming more rational/intelligent, only exists from irrational perspectives, in a way dependent on the mind architecture)
minimal CEV-realist view. it really seems up to agents how much they care about their reflected preferences. maybe changing preferences too often leads to money pumps, or something?
basically says "there are irrational and rational agents, rationality doesn't apply to irrational agents", seems somewhat how people treat animals (we don't generally consider uplifting normative with respect to animals)
at this point you're at something like ecology / evolutionary game theory, it's a matter of which things tend to survive/reproduce and there aren't general decision theories that succeed

re human ontological crises: basically agree, I think it's reasonably similar to what I wrote. roughly my reason for thinking that it's hard to solve is that the ideal case would be something like a universal algebra homomorphism (where the new ontology actually agrees with the old one but is more detailed), yet historical cases like physics aren't homomorphic to previous ontologies in this way, so there is some warping necessary. you could try putting a metric on the warping and minimizing it, but, well, why would someone think the metric is any good, it seems more of a preference than a thing rationality applies to. if you think about it and come up with a solution, let me know, of course.

with respect to grabbing hold of the whole future: you can try looking at historical cases of people trying to grab hold of the future and seeing how that went, it's a mixed bag with mostly negative reputation, indicating there are downsides as well as upsides, it's not a "safe" conservative view. see also Against Responsibility. I feel like there's a risk of getting Pascal's mugged about "maybe grabbing hold of the future is good, you can't rule it out, so do it", there are downsides to spending effort that way. like, suppose some Communists thought capitalism would lead to the destruction of human value with high enough probability that instituting global communism is the conservative option, it doesn't seem like that worked well (even though a lot of people around here would agree that capitalism tends to leads to human value destruction in the long run). particular opportunities for grabbing hold of the future can be net negative and not worth worrying about even if one of them is a good idea in the long run (I'm not ruling that out, just would have to be convinced of specific opportunities).

overall I'd rather focus on first modeling the likely future and looking for plausible degrees of freedom; a general issue with Pascal's mugging is it might make people overly attached to world models in which they have ~infinite impact (e.g. Christianity, Communism) which means paying too much attention to wrong world models, not updating to more plausible models in which existential-stakes decisions could be comprehended if they exist. and Obliqueness doesn't rule out existential stakes (since it's non-Diagonal).

as another point, Popperian science tends to advance by people making falsifiable claims, "you don't know if that's true" isn't really an objection in that context. the pragmatic claim I would make is: I have some Bayesian reason to believe agents do not in general factor into separate Orthogonal and Diagonal components, this claim is somewhat falsifiable (someone could figure out a theory of this invulnerable to optimization daemons etc), I'm going to spend my attention on the branch where I'm right, I'm not going to worry about Pascal's mugging type considerations for if I'm wrong (as I said, modeling the world first seems like a good general heuristic), people can falsify it eventually if it's false.

this whole discussion is not really a defense of Orthogonality given that Yudkowsky presented orthogonality as a descriptive world model, not a normative claim, so sticking to the descriptive level in the original post seems valid; it would be a form of bad epistemology to reject a descriptive update (assuming the arguments are any good) because of pragmatic considerations.

Replies from: habryka4

↑ comment by habryka (habryka4) · 2024-09-19T23:17:39.538Z · LW(p) · GW(p)

with respect to grabbing hold of the whole future: you can try looking at historical cases of people trying to grab hold of the future and seeing how that went, it's a mixed bag with mostly negative reputation, indicating there are downsides as well as upsides, it's not a "safe" conservative view. see also Against Responsibility. I feel like there's a risk of getting Pascal's mugged about "maybe grabbing hold of the future is good, you can't rule it out, so do it", there are downsides to spending effort that way.

I agree with a track-record argument of this, but I think the track record of people trying to broadly ensure that humanity continues to be in control of the future (while explicitly not optimizing for putting themselves personally in charge) seems pretty good to me.

Generally a lot of industrialist and human-empowerment stuff has seemed pretty good to me on track record, and I really feel like all the bad parts of this are screened off by the "try to put yourself and/or your friends in charge" component.

Replies from: gallabytes, jessica.liu.taylor

↑ comment by gallabytes · 2024-09-20T02:00:43.759Z · LW(p) · GW(p)

the track record of people trying to broadly ensure that humanity continues to be in control of the future

What track record?

↑ comment by jessicata (jessica.liu.taylor) · 2024-09-19T23:26:43.908Z · LW(p) · GW(p)

hmm, I wouldn't think of industrialism and human empowerment as trying to grab the whole future, just part of it, in line with the relatively short term (human not cosmic timescale) needs of the self and extended community; industrialism seems to lead to capitalist organization which leads to decentralization superseding nations and such (as Land argues).

I think communism isn't generally about having one and one's friends in charge, it is about having human laborers in charge. One could argue that it tended towards nationalism (e.g. USSR), but I'm not convinced that global communism (Trotskyism) would have worked out well either. Also, one could take an update from communism about agendas for global human control leading to national control (see also tendency of AI safety to be taken over by AI national security as with the Situational Awareness paper). (Again, not ruling out that grabbing hold of the entire future could be a good idea at some point, just not sold on current agendas and wanted to note there are downsides that push against Pascal's mugging type considerations)

comment by habryka (habryka4) · 2024-09-19T23:06:05.191Z · LW(p) · GW(p)

While I believe Scott Garrabrant and/or Ambram Demski have discussed such duality, I haven't found a relevant post on the Alignment Forum about this, so I'll present the basic idea in this post.

There is a post on this. It's one of my favorite posts: https://www.lesswrong.com/posts/oheKfWA7SsvpK7SGp/probability-is-real-and-value-is-complex [LW · GW]

Replies from: jessica.liu.taylor

↑ comment by jessicata (jessica.liu.taylor) · 2024-09-19T23:13:33.506Z · LW(p) · GW(p)

Thanks, going to link this!

comment by AnthonyC · 2024-09-20T14:56:41.599Z · LW(p) · GW(p)

Why should I agree that a boundedly rational agent's goals need to be computationally tractable? Humans have goals and desires they lack the capability to achieve all the time. Sometimes they make plans to try to increase tractability, and sometimes those plans work, but there's nothing odd about intractable goals. It might be a mistake in some senses to build such an agent, but that's a different question.

Replies from: jessica.liu.taylor

↑ comment by jessicata (jessica.liu.taylor) · 2024-09-20T16:34:15.947Z · LW(p) · GW(p)

Computationally tractable is Yudkowsky's framing and might be too limited. The kind of thing I believe is for example, an animal without a certain brain complexity will tend not to be a social animal and is therefore unlikely to have the sort of values social animals have. And animals that can't do math aren't going to value mathematical aesthetics the way human mathematicians do.

Replies from: AnthonyC

↑ comment by AnthonyC · 2024-09-20T22:39:02.010Z · LW(p) · GW(p)

Ah ok, that makes sense. That's more about being able to understand what the goal is, not about the ability to compute what actions are able to achieve it.

comment by ||||| (infini-tesimal) · 2025-02-19T04:41:55.378Z · LW(p) · GW(p)

My moral philosophy is rusty at best, but doesn't Putnam make a similar argument about the factorization of fact and value?

Replies from: jessica.liu.taylor

↑ comment by jessicata (jessica.liu.taylor) · 2025-02-19T21:38:59.646Z · LW(p) · GW(p)

Wasn't familiar. Seems similar in that facts/values are entangled. I was more familiar with Cuneo for that.

comment by Cole Wyeth (Amyr) · 2024-11-20T17:17:32.854Z · LW(p) · GW(p)

I think that at least the weak orthogonality thesis survives these arguments in the sense that any coherent utility function over an ontology "closely matching" reality should in principle be reachable for arbitrarily intelligent agents, along some path of optimization/learning. Your only point that seems to contradict this is the existence of optimization daemons, but I'm confident that an anti-daemon immune system can be designed, so any agent that chooses to design itself in a way where it can be overtaken by daemons must do this with the knowledge that something close to its values will still be optimized for - so this shouldn't cause much observable shift in values.

It's unclear how much measure is assigned to various "final/limiting" utility functions by various agent construction schemes - I think this is far beyond our current technical ability to answer.

Personally, I suspect that the angle is more like 60 degrees, not 3.

comment by winstonne · 2024-09-19T17:07:17.114Z · LW(p) · GW(p)

Hi! Long time lurker, first time commenter. You have written a great piece here. This is a topic that has fascinated me for a while and I appreciate what you've laid out.

I'm wondering if there's a base assumption on the whole intelligence vs values/beliefs/goals question that needs to be questioned.

sufficiently complex goals may only be well-represented by sufficiently intelligent agents

This statement points to my question. There's necessarily a positive correlation between internal complexity and intelligence right?. So, in order for intelligence to increase, internal complexity must also increase. My understanding is that complexity is a characteristic of dynamic and generative phenomena, and not of purely mechanical phenomena. So, what do we have to assume in order to posit a super-intelligent entity exists? It must have maintained its entity-ness over time in order to have increased its intelligence/complexity to its current level.

Has anyone explored what it takes for an agent to complexify? I would presume that for an agent to simultaneously continue existing and complexify it must stay maintain some type of fixpoint/set of autopoietic (self-maintenance, self-propagation) values/beliefs/goals throughout its dynamic evolution. If this were the case, wouldn't it be true that there must exist a set of values/beliefs/goals that are intrinsic to the agent's ability to complexify? Therefore there must be another set of values/beliefs/goals that are incompatible with self-complexification. If so, can we not put boundary conditions on what values/beliefs/goals are both necessary as well as incompatible with sufficiently intelligent, self-complexifying agents? After all, if we observe a complex agent, the probability of it arising full-cloth and path-independently is vanishingly small, so it is safe to say that the observed entity has evolved to reach the observed state.

I don't think my observation is incompatible with your argument, but might place further limits on what relationships we can possibly see between entities of sufficient intelligence and their goals/values/beliefs than the limits you propose.

I think situations like a paperclip maximizer may still occur but they are degenerate cases where an evolutionarily fit entity spawns something that inherits much of its intrinsic complexity but loses its autopoietic fixpoint. Such systems do occur in nature, but to get that system, you must also assume a more-complex (and hopefully more intelligent/adapted) entity exists as well. This other entity would likely place adversarial pressure on the degenerate paperclip maximizer as it threatens its continued existence.

Some relationships/overlaps with your arguments are as follows:

totally agree with the belief/value duality
Naive belief/value factorizations lead to optimization daemons. The optimization daemons observation points to an agent's inability to maintain autopoiesis over time, implying misalignment of its values/beliefs/goals with its desire to increase its intelligence
Intelligence changes the ontology values are expressed in. I presume that any ontology expressed by an autopoietic embedded agent must maintain concepts of self, otherwise the embedded entity cannot continue to complexify over time, therefore there must be some fix point in ontological evolution that preserves the evolutionary drive of the entity in order for it to continue to advance its intelligence

Anyways, thank you for the essay.

Replies from: jessica.liu.taylor

↑ comment by jessicata (jessica.liu.taylor) · 2024-09-19T23:23:15.602Z · LW(p) · GW(p)

Not sure what you mean by complexity here, is this like code size / Kolmogorov complexity? You need some of that to have intelligence at all (the empty program is not intelligent). At some point most of your gains come from compute rather than code size. Though code size can speed things up (e.g. imagine sending a book back to 1000BC, that would speed people up a lot; consider that superintelligence sending us a book would be a bigger speedup)

by "complexify" here it seems you mean something like "develop extended functional organization", e.g. in brain development throughout evolution. And yeah, that involves dynamics with the environment and internal maintenance (evolution gets feedback from the environment). It seems it has to have a drive to do this which can either be a terminal or instrumental goal, though deriving it from instrumentals seems harder than baking it is as terminal (so I would guess evolution gives animals a terminal goal of developing functional complexity of mental structures etc, or some other drive that isn't exactly a terminal goal)

see also my post [LW · GW] relating optimization daemons to immune systems, it seems evolved organisms develop these; when having more extended functional organization, they protect it with some immune system functional organization.

to be competitive agents, having a "self" seems basically helpful, but might not be the best solution; selfish genes are an alternative, and perhaps extended notions of self can maintain competitiveness.

comment by romeostevensit · 2024-09-20T05:52:39.980Z · LW(p) · GW(p)

You mention 'warp' when talking about cross ontology mapping which seems like your best summary of a complicated intuition. I'd be curious to hear more (I recognize this might not be practical). My own intuition surfaced 'introducing degrees of freedom' a la indeterminacy of translation.

Replies from: jessica.liu.taylor

↑ comment by jessicata (jessica.liu.taylor) · 2024-09-20T16:32:14.892Z · LW(p) · GW(p)

Relativity to Newtonian mechanics is a warp in a straightforward sense. If you believe the layout of a house consists of some rooms connected in a certain way, but there are actually more rooms connected in different ways, getting the maps to line up looks like a warp. Basically, the closer the mapping is to a true homomorphism (in the universal algebra sense), the less warping there is, otherwise there are deviations intuitively analogous to space warps.

The Obliqueness Thesis

Contents

Bayes/VNM point against Orthogonality

Belief/value duality

Logical uncertainty as a model for bounded rationality

Naive belief/value factorizations lead to optimization daemons

Intelligence changes the ontology values are expressed in

Intelligence leads to recognizing value-relevant symmetries

Human brains don't seem to neatly factorize

Models of ASI should start with realism

On Yudkowsky's arguments

Conclusion

19 comments