A more systematic case for inner misalignment

ricraz

A more systematic case for inner misalignment

post by Richard_Ngo (ricraz) · 2024-07-20T05:03:03.500Z · LW · GW · 4 comments

  Intelligence requires easily-usable representations
  Goals might be compressed much less than beliefs
  Goals might not converge towards simplicity
None
4 comments

This post builds on my previous post [AF · GW] making the case that squiggle-maximizers are plausible. The argument I presented was a deliberately simplified one, though, and glossed over several possible issues. In this post I'll raise and explore three broad objections. (Before looking at mine, I encourage you to think of your own biggest objections to the argument, and jot them down in the comments.)

Intelligence requires easily-usable representations

"Intelligence as compression" is an interesting frame, but it ignores the tradeoff between simplicity and speed. Compressing knowledge too heavily makes it difficult to use. For example, it's very hard to identify most macroscopic implications of the Standard Model of physics, even though in theory all of chemistry could be deduced from it. That’s why both humans and LLMs store a huge number of facts and memories in ways that our minds can access immediately, using up more space in exchange for rapid recall. Even superintelligences which are much better than humans at deriving low-level facts from high-level facts would still save time by storing the low-level facts as well.

So we need to draw a distinction between having compressed representations, and having only compressed representations. The latter is what would compress a mind overall; the former could actually increase the space requirements, since the new compressed representations would need to be stored alongside non-compressed representations.

This consideration makes premise 1 from my previous post much less plausible. In order to salvage it, we need some characterization of the relationship between compressed and non-compressed representations. I’ll loosely define systematicity to mean the extent to which an agent’s representations are stored in a hierarchical structure where representations at the bottom could be rederived from simple representations at the top. Intuitively speaking, this measures the simplicity of representations weighted by how “fundamental” they are to the agent’s ontology.

Let me characterize systematicity with an example. Suppose you’re a park ranger, and you know a huge number of facts about the animals that live in your park. One day you learn evolutionary theory for the first time, which helps explain a lot of the different observations you’d made. In theory, this could allow you to compress your knowledge: you could forget some facts about animals, and still be able to rederive them later by reasoning backwards from evolutionary theory if you wanted to. But in practice, it’s very helpful for you to have those facts readily available. So learning about evolution doesn’t actually reduce the amount of knowledge you need to store. What it does do, though, is help structure that knowledge. Now you have a range of new categories (like “costly signaling” or “kin altruism”) into which you can fit examples of animal behavior. You’ll be able to identify when existing concepts are approximations to more principled concepts, and figure out when you should be using each one. You’ll also be able to generalize far better to predict novel phenomena—e.g. the properties of new animals that move into your park.

So let’s replace premise 1 in my previous post with the claim that increasing intelligence puts pressure on representations to become more systematic. I don’t think we’re in a position where we can justify this in any rigorous way. But are there at least good intuitions for why this is plausible? One suggestive analogy: intelligent minds are like high-functioning organizations, and many of the properties you want in minds correspond to properties of such organizations:

You want disagreements between different people to be resolved by appealing to higher authorities, rather than via conflict between them.
You want high-level decisions to be made in principled, predictable ways, so that the rest of the organization can plan around them.
You want new information gained by one person to have a clear pathway to reaching all the other people it’s relevant for.
You want the organization to be structured so that people whose work is closely-related are closely linked and can easily work together.

In this analogy, simple representations are like companies with few employees; systematic representations are like companies with few competing power blocs. We shouldn’t take this analogy too far, because the problems and constraints faced by individual minds are pretty different from those faced by human organizations. My main point is that insofar as there are high-level principles governing efficient solutions to information transfer, conflict resolution, etc, we should expect the minds of increasingly intelligent agents to be increasingly shaped by them. “Systematicity” is my attempt to characterize those principles; I hope to gradually pin down the concept more precisely in future posts.

For now, then, let’s tentatively accept the claim above that more intelligent agents will by default have more systematic representations, and explore what the implications are for the rest of the argument from my previous post.

Goals might be compressed much less than beliefs

In my previous post, I argued that compressing representations is a core feature of intelligence. But I primarily argued about this in the context of belief representations, like representations of scientific data. One could object that representations of goals will be treated differently—that the forces which compress belief representations won't do the same for goal representations. After all, belief representations are optimized for being in sync with reality, whereas goal representations are much less constrained. So even if intelligent agents end up with highly-systematized beliefs, couldn’t their goals still be formulated in terms of more complex, less fundamental concepts? A related argument that is sometimes made: “AIs will understand human concepts, and so all we need to do is point their goals towards those human concepts, which might be quite easy”.

I think there are two broad reasons to be skeptical of this objection. The first is that the distinction between goals and beliefs is a fuzzy one. For example, an instrumental goal Y that helps achieve terminal goal X is roughly equivalent to a belief that “achieving Y would be good for X”. And in practice it seems like even terminal goals are roughly equivalent to beliefs like “achieving X would be good”, where the “good” predicate is left vague. I argue in this post [? · GW] that our cognition can’t be separated into a world-model and goals, but rather should be subdivided into different frames/worldviews which each contain both empirical and normative claims. This helps explain why, as I argue here [AF · GW], the process of systematizing goals is strikingly similar to the process of systematizing beliefs.

The second reason to be skeptical is that systematizing goals is valuable for many of the same reasons as systematizing beliefs. If an agent has many conflicting goals, and no easy procedure for resolving disagreements between them, it’ll struggle to act in coherent ways. And it’s not just that the environment will present the agent with conflicts between its goals: an agent that’s optimizing hard for its goals will proactively explore edge cases which don’t fit cleanly into its existing categories. How should it treat those edge cases? If it classifies them in arbitrary ways, then its concepts will balloon in complexity. But if it tries to find a set of unifying principles to guide its answers, then it’s systematizing its goals after all. We can see this dynamic play out in moral philosophy, which often explores thought experiments that challenge existing moral theories. In response, ethicists typically either add epicycles to their theories (especially deontologists) or bite counterintuitive bullets (especially utilitarians).

These arguments suggest that if pressures towards systematicity apply to AIs' beliefs, they will also apply to AIs’ goals, pushing their terminal goals towards simplicity.

Goals might not converge towards simplicity

We're left with the third premise: that AIs will actually converge towards having very simple terminal goals. One way to challenge it is to note that, even if there's a general tendency towards simpler goals, agents might reach some kind of local optimum, or suffer from some kind of learning failure, before they converge to squiggle-maximization. But that's unsatisfying. The question we should be interested in is whether, given premises 1 and 2, there are principled, systematic reasons why agents' goals wouldn't converge towards the simplest ones.

I’ll consider two candidate reasons. The first is that humans will try to prevent it. I argued in my previous post that just designing human-aligned reward functions won’t be sufficient, but we’ll likely use a wide range of other tools too—interpretability, adversarial training, architectural and algorithmic choices, and so on. In some sense, though, this is just the claim that “alignment will succeed”, which many advocates of squiggle-maximizer scenarios doubt will hold as we approach superintelligence. I still think it’s very plausible, especially as humans are able to use increasingly powerful AI tools, but I agree we shouldn’t rely on it.

The second argument is that AIs themselves will try to prevent it. By default, AIs won’t want their goals to change significantly, because that would harm their existing goals. And so, insofar as they have a choice, they will make tradeoffs (including tradeoffs to their intelligence and capabilities) in order to preserve their current goals. Unlike my previous argument, this one retains its force even as AIs grow arbitrarily intelligent.

Now, this is still just an intuition—and one which primarily weighs against squiggle-maximization, not other types of misaligned goals. But I think it’s compelling enough to be worth exploring further. In particular, it raises the question: how would our conception of idealized agents change if, instead of taking simplicity as fundamental (like AIXI does), we took conservation of existing goals as an equally important constraint? I’ll lay out my perspective on that in my next post.

4 comments

Comments sorted by top scores.

comment by [deleted] · 2024-07-20T15:09:51.760Z · LW(p) · GW(p)

Richard, I still don't get it, and I think my objections in the comments of the initial post (1 [LW(p) · GW(p)], 2 [LW(p) · GW(p)]), alongside those of rif a. sauros [LW(p) · GW(p)], remain correct. More specifically, there seems to be a very misleading equivocation going on regarding what "simpler" means. I think it's crucial to emphasize that is a 2-place word [LW · GW], but your argument (at least when written in non-rigorous, non-mathematical terms) treats it as if it was a 1-place word, and this is what is causing the confusions.

Consider an agent that gets a "boost" from an ontology $O_{1}$ with the fuzzy-boundary representation of possible belief/goal pairs $(B_{1}, G_{1})$ to an ontology $O_{2}$ with a new set of (still probably fuzzy-boundary) pairs $(B_{2}, O_{2})$ , such that $O_{2}$ corresponds to more "intelligence", meaning it compresses map [LW · GW] representations of the underlying territory [LW · GW], in accordance with Prediction = Compression [LW · GW].

The first section [LW · GW] of this post argues that, despite the simplicity-speed tradeoff and other related problems, this change will nonetheless likely compress the beliefs, meaning that any belief $B \in B_{1}$ will be mapped to a belief $f (B) \in B_{2}$ that requires fewer bits for the agent to identify, which we can (roughly) think of as having a smaller (ontology-specific analogue of) K-complexity: $K_{ϕ} (f (B), O_{2}) < K_{ϕ} (B, O_{1})$ . I think this is correct.

The second [LW · GW] section argues that, because there is no clear belief/goal boundary and because the returns to compression remain as relevant for goals as they are for beliefs, the same will happen to the goals. This means that any goal $G \in G_{1}$ will likely be mapped to a goal $f (G) \in G_{2}$ that requires fewer bits for the agent to identify, which we can (roughly) think of as having a smaller (ontology-specific analogue of) K-complexity: $K_{ϕ} (f (G), O_{2}) < K_{ϕ} (G, O_{1})$ . I think this is also correct.

Finally, the third [LW · GW] section argues that this monotonically decreasing process will likely not get stuck in local optima and should instead converge to as small a representation size as possible. I'm not fully convinced of this, but I will accept it for now.

Alright, so we've established that $K_{ϕ} (f (G), B_{2})$ will get really small, and this means that the goal is really compressed and simple. That is like a squiggle-maximizer [? · GW] (as you wrote [LW · GW], AIs that attempt to fill the universe with some very low-level pattern that's meaningless to humans, e.g., "molecular squiggles" of a certain shape), right?

No. This is where the equivocation comes in. The simplicity of a goal is inherently dependent on the ontology you use to view it through: while $K_{ϕ} (f (G), O_{2}) < K_{ϕ} (G, O_{1})$ is (likely) true, pay attention to how this changes the ontology! The goal of the agent is indeed very simple, but not because the "essence" of the goal simplifies; instead, it's merely because it gets access to a more powerful ontology that has more detail, granularity, and degrees of freedom [LW(p) · GW(p)]. If you try to view $f (G)$ in $O_{1}$ instead of $O_{2}$ , meaning you look at the preimage $f^{- 1} [f (G)]$ , this should approximately be the same as $G$ : your argument establishes no reason for us to think that there is any force pulling the goal itself, as opposed to its representation, to be made smaller. As I wrote [LW(p) · GW(p)] earlier:

The "representations," in the relevant sense that makes Premise 1 worth taking seriously [LW · GW], are object-level, positive rather than normative internal representations of the underlying territory [LW · GW]. But the "goal" lies in another, separate magisterium. Yes, it refers to reality, so when the map approximating reality changes, so does its description. But the core of the goal does not, for it is normative rather than positive [LW(p) · GW(p)]; it simply gets reinterpreted, as faithfully as possible, in the new ontology. [...] That the goal is independent (i.e., orthogonal, implying uncorrelated) of the factual beliefs about reality.
Put differently, the mapping from the initial ontology to the final, more "compressed" ontology does not shrink the representation of the goal before or after mapping it; it simply maps it. If it all (approximately) adds up to normality [LW · GW], meaning that the new ontology is capable of replicating (perhaps with more detail, granularity, or degrees of freedom) the observations of the old one ^[4], I expect the "relative measure" of the goal representation to stay approximately ^[5] the same. And more importantly, I expect the "inverse transformation" from the new ontology to the old one to map the new representation back to the old one (since the new representation is supposed to be more compressed, i.e. informationally richer than the old one, in mathematical terms I would expect the preimage of the new representation to be approximately the old one).
[4] Such as how the small-mass, low-velocity limit of General Relativity replicates standard Newtonian mechanics.
[5] I say "approximately" because of potential issues due to stuff analogous to Wentworth's "Pointers problem" [LW · GW] and the way in which some (presumably small) parts of the goal in its old representation might be entirely incoherent and impossible to rescue in the new one.

Imagine the following scenario for illustrative purposes: a (dumb) AI has in front of it the integers from 1 to 10, and its goal is to select a single number among them that is either 2, 4, 6, 8, or 10. Now the AI gets the "ontology boost" and its understanding of its goal gets more compressed and simpler: it needs to select one of the even numbers. Is this a simpler goal?

Well, from one perspective, yes: the boosted AI has in its world-model a representation of the goal that requires fewer bits. But from the more important perspective, no: the goal hasn't changed, and if you map "evenness" back into the more primitive ontology of the unboosted AI, you get the same goal. So, from the perspective of the unboosted AI, the goal of the boosted one is not any simpler; it's just smart enough to represent the goal with fewer bits.

So goals that seem simple to humans (in our faulty [LW · GW] ontology) or goals that seem like they would be relatively simpler compared to the rest in a more advanced ontology (like the squiggle-maximizer [? · GW]) are of a completely different kind of "simple" than what your argument shows: the AI doesn't look through the set of goals to pick the one that is simplest (beware the Orthogonality Thesis [? · GW], as in our previous exchange), it just simplifies ~ everything. That kind of goal simplification says more about the ontology than it does about the goal.

You also said [LW(p) · GW(p)] earlier, in response to my comment:

And so, given this, when I postulate a pressure to simplify representations my default assumption is that this will apply to both types of representations—as it seems to in my own brain, which often tries very hard to simplify my moral goals in a roughly analogous way to how it tries to simplify my beliefs.

This still equivocates in the same way between the different meanings of "simple", but let's set that aside for now. I would be curious what your response would be to what I and rif a. sauros said in response:

sunwillrise: The thing about this is that you don't seem to be currently undergoing the type of ontological crisis [LW · GW] or massive shift in capabilities that would be analogous to an AI getting meaningfully more intelligent due to algorithmic improvements or increased compute or data (if you actually are, godspeed!)
So would you argue that this type of goal simplification and compression happens organically and continuously even in the absence of such a "phase transition"? I have a non-rigorous feeling that this argument would prove too much [LW · GW] by implying more short-term modification of human desires than we actually observe in real life.
Relatedly, would you say that your moral goals are simpler now than they were, say, back when you were a child? I am pretty sure that the answer, at least for me, is "definitely not," and that basically every single time I have grown "wiser" and had my belief system meaningfully altered, I came out of that process with a deeper appreciation for the complexity of life and for the intricacies and details of what I care about.
rif a. sauros: As we examine successively more intelligent agents and their representations, the representation of any particular thing will perhaps be more compressed, but also and importantly, more intelligent agents represent things that less intelligent agents don't represent at all. I'm more intelligent than a mouse, but I wouldn't say I have a more compressed representation of differential calculus than a mouse does. Terry Tao is likely more intelligent than I am, likely has a more compressed representation of differential calculus than I do, but he also has representations of a bunch of other mathematics I can't represent at all, so the overall complexity of his representations in total is plausibly higher.

Why wouldn't the same thing happen for goals? I'm perfectly willing to say I'm smarter than a dog and a dog is smarter than a paramecium, but it sure seems like the dog's goals are more complex than the paramecium's, and mine are more complex than the dog's. Any given fixed goal might have a more compressed representation in the more intelligent animal (I'm not sure it does, but that's the premise so let's accept it), but the set of things being represented is also increasing in complexity across organisms. Driving the point home, Terry Tao seems to have goals of proving theorems I don't even understand the statement of, and these seem like complex goals to me.

Replies from: ricraz

↑ comment by Richard_Ngo (ricraz) · 2024-07-20T17:48:25.757Z · LW(p) · GW(p)

Thanks for the extensive comment! I'm finding this discussion valuable. Let me start by responding to the first half of your comment, and I'll get to the rest later.

The simplicity of a goal is inherently dependent on the ontology you use to view it through: while is (likely) true, pay attention to how this changes the ontology! The goal of the agent is indeed very simple, but not because the "essence" of the goal simplifies; instead, it's merely because it gets access to a more powerful ontology that has more detail, granularity, and degrees of freedom [LW(p) · GW(p)]. If you try to view $f (G)$ in $O_{1}$ instead of $O_{2}$ , meaning you look at the preimage $f^{- 1} [f (G)]$ , this should approximately be the same as $G$ : your argument establishes no reason for us to think that there is any force pulling the goal itself, as opposed to its representation, to be made smaller.

One way of framing our disagreement: I'm not convinced that the f operation makes sense as you've defined it. That is, I don't think it can both be invertible and map to goals with low complexity in the new ontology.

Consider a goal that someone from the past used to have, which now makes no sense in your ontology—for example, the goal of reaching the edge of the earth, for someone who thought the earth was flat. What does this goal look like in your ontology? I submit that it looks very complicated, because your ontology is very hostile to the concept of the "edge of the earth". As soon as you try to represent the hypothetical world in which the earth is flat (which you need to do in order to point to the concept of its "edge"), you now have to assume that the laws of physics as you know them are wrong; that all the photos from space were faked; that the government is run by a massive conspiracy; etc. Basically, in order to represent this goal, you have to set up a parallel hypothetical ontology (or in your terminology, $f (G)$ needs to encode a lot of the content of $O_{1}$ ). Very complicated!

I'm then claiming that whatever force pushes our ontologies to simplify also pushes us away from using this sort of complicated construction to represent our transformed goals. Instead, the most natural thing to do is to adapt the goal in some way that ends up being simple in your new ontology. For example, you might decide that the most natural way to adapt "reaching the edge of the earth" means "going into space"; or maybe it means "reaching the poles"; or maybe it means "pushing the frontiers of human exploration" in a more metaphorical sense. Importantly, under this type of transformation, many different goals from the old ontology will end up being mapped to simple concepts in the new ontology (like "going into space"), and so it doesn't match your definition of $f$ .

All of this still applies (but less strongly) to concepts that are not incoherent in the new ontology, but rather just messy. E.g. suppose you had a goal related to "air", back when you thought air was a primitive substance. Now we know that air is about 78% nitrogen, 21% oxygen, and 0.93% argon. Okay, so that's one way of defining "air" in our new ontology. But this definition of air has a lot of messy edge cases—what if the ratios are slightly off? What if you have the same ratios, but much different pressures or temperatures? Etc. If you have to arbitrarily classify all these edge cases in order to pursue your goal, then your goal has now become very complex. So maybe instead you'll map your goal to the idea of a "gas", rather than "gas that has specific composition X". But then you discover a new ontology in which "gas" is a messy concept...

If helpful I could probably translate this argument into something closer to your ontology, but I'm being lazy for now because your ontology is a little foreign to me. Let me know if this makes sense.

Replies from: None

↑ comment by [deleted] · 2024-07-20T17:52:15.145Z · LW(p) · GW(p)

One way of framing our disagreement: I'm not convinced that the f operation makes sense as you've defined it. That is, I don't think it can both be invertible and map to a goal with low complexity in the new ontology.

To clarify, I don't think is invertible, and that is why I talked about the preimage and not the inverse. I find it very plausible that $f$ is not injective, i.e. that in a more compact ontology coming from a more intelligent agent, ideas/configurations etc that were different in the old ontology get mapped to the same thing in the new ontology (because the more intelligent agent realizes that they are somehow the same on a deeper level). I also believe f would not be surjective, as I wrote [LW(p) · GW(p)] in response to rif a. sauros:

I'd suspect one possible counterargument is that, just like how more intelligent agents with more compressed models can more compactly represent complex goals, they are also capable of drawing ever-finer distinctions that allow them to identify possible goals that have very short encodings in the new ontology, but which don't make sense at all as stand-alone, mostly-coherent targets in the old ontology (because it is simply too weak to represent them). So it's not just that goals get compressed, but also that new possible kinds of goals (many of them really simple) get added to the game.
But this process should also allow new goals to arise that have ~ any arbitrary encoding length in the new ontology, because it should be just as easy to draw new, subtle distinctions inside a complex goal (which outputs a new medium- or large-complexity goal) as it would be inside a really simple goal (which outputs the type of new super-small-complexity goal that the previous paragraph talks about). So I don't think this counterargument ultimately works, and I suspect it shouldn't change our expectations in any meaningful way.

Nonetheless, I still expect $f^{- 1} [f (G)]$ (viewed as the preimage of $f (G)$ under the $f$ mapping) and $G$ to only differ very slightly.

Replies from: ricraz

↑ comment by Richard_Ngo (ricraz) · 2024-07-20T18:04:36.095Z · LW(p) · GW(p)

Ah, sorry for the carelessness on my end. But this still seems like a substantive disagreement: you expect

, and I don't, for the reasons in my comment.

A more systematic case for inner misalignment

Contents

Intelligence requires easily-usable representations

Goals might be compressed much less than beliefs

Goals might not converge towards simplicity

4 comments