Ultimate ends may be easily hidable behind convergent subgoals

post by TsviBT · 2023-04-02T14:51:23.245Z · LW · GW · 4 comments

Contents

  What can you tell about an agent's ultimate intent by its behavior?
    Terms
    Inferring supergoals through subgoals
    Messiness in relations between goals
  Factors obscuring supergoals
    Fungibility
      Terms
      Examples
      Effects on goal structure
    Canonicity
      Effects on goal structure
    Instrumental convergence
      Fungibility spirals
      Existence
      Effects on goal structure
    The inspection paradox for subgoals
  Hidden ultimate ends
    Are convergent instrumental goals near-universal for possibilizing?
    Non-adaptive goal hiding
    Adaptive goal hiding
None
4 comments

[Metadata: crossposted from https://tsvibt.blogspot.com/2022/12/ultimate-ends-may-be-easily-hidable.html. First completed December 18, 2022.]

Thought and action in pursuit of convergent instrumental subgoals do not automatically reveal why those subgoals are being pursued--towards what supergoals--because many other agents with different supergoals would also pursue those subgoals, maybe with overlapping thought and action. In particular, an agent's ultimate ends don't have to be revealed by its pursuit of convergent subgoals. It might might therefore be easy to covertly pursue some ultimate goal by mostly pursuing generally useful subgoals of other supergoals. By the inspection paradox for the convergence of subgoals, it might be easy to think and act almost comprehensively like a non-threatening agent would think and act, while going most of the way towards achieving some other more ambitious goal.

Note: the summary above is the basic idea. The rest of the essay analyzes the idea in a lot of detail. The final main section might be the most interesting.

What can you tell about an agent's ultimate intent by its behavior?

An agent's ultimate intent is what the agent would do if it had unlimited ability to influence the world. What can we tell about an agent's ultimate intent by watching the external actions it takes, whether low-level (e.g. muscle movements) or higher-level (e.g. going to the store), and by watching its thinking (e.g. which numbers it's multiplying, which questions it's asking, which web searches it's running, which concepts are active)?

Terms

Inferring supergoals through subgoals

Suppose that is an intermediate goal for some agent. By observing more fully the constellation of action and thought the agent does in pursuit of the goal , we become more confident that the agent is pursuing . That is, observing the agent pursuing subgoals that constitute much of a sufficient strategy for is evidence that the agent has the goal . If we observe that the farmer turns the earth, broadcasts the seed, removes the weeds, sprays for pests, and fences out wild animals, we become more and more sure that ze is trying to grow crops.

But, we don't necessarily know what ze intends to do with the crops, e.g. eat them or sell them. Information doesn't necessarily flow from subgoals up through the fact that the farmer is trying to grow crops, to indicate supergoals; the possible supergoals may be screened off from the observed subgoals. In that case, we're uncertain which of the supergoals in the motivator set of [grow crops] is the one held by the farmer.

Suppose an agent is behaving in a way that looks to us like it's pursuing some instantiation of a goal-state . (We always abstract somewhat from to , e.g. we don't say "the farmer has a goal of growing rhubarb with exactly the lengths [14.32 inches, 15.11 inches, 15.03 inches, ...] in Maine in 2022 using a green tractor while singing sea shanties", even if that's what is.) Some ways to infer the supergoals of the agent by observing its pursuit of :

These points leave out how to infer goals from observed behavior, except through observed pursuit of subgoals. This essay takes for granted that some goals can be inferred from [behavior other than subgoal pursuit], e.g. by observing the result of the behavior, by predicting the result of the behavior through simulation, by analogy with other similar behavior, or by gemini modeling the goal.

Messiness in relations between goals

Factors obscuring supergoals

Fungibility

Note: the two following subsections, Terms and Examples, aren't really needed to read what comes after; their purpose is to clarify the idea of fungibility. Maybe read the third subsection, "Effects on goal structure", and then backtrack if you want more on fungibility.

Terms

If is use-fungible, then it is effectively-use-fungible: the strategy using already probably works with instead, by use-fungibility.

If is state-fungible, then it is effectively-use-fungible: given a working strategy using , there's a strategy that, given , first easily produces from , which is doable by state-fungibility, and then follows .

Examples

Effects on goal structure

Canonicity

In "The unreasonable effectiveness of mathematics in the natural sciences", Wigner discusses the empirical phenomenon that theories in physics are expressed using ideas that were formed by playing around with ideas and selected such that they are apt for demonstrating a sense of formal beauty and ingenious skill at manipulating ideas.

Some of this empirical phenomenon could be explained by abstract mathematical concepts being canonical. That is, there's in some sense only one form, or very few forms, that this concept can take. Then, when a concept is discovered once by mathematicians, it is discovered in roughly the same form as will be useful later in another context. Compare Thingness.

Canonicity is not the same thing as simplicity. Quoting Wigner:

It is not true, however, as is so often stated, that this had to happen because mathematics uses the simplest possible concepts and these were bound to occur in any formalism. As we saw before, the concepts of mathematics are not chosen for their conceptual simplicity - even sequences of pairs of numbers are far from being the simplest concepts - but for their amenability to clever manipulations and to striking, brilliant arguments. Let us not forget that the Hilbert space of quantum mechanics is the complex Hilbert space, with a Hermitean scalar product. Surely to the unpreoccupied mind, complex numbers are far from natural or simple and they cannot be suggested by physical observations. Furthermore, the use of complex numbers is in this case not a calculational trick of applied mathematics but comes close to being a necessity in the formulation of the laws of quantum mechanics.

However, it might be that canonicity is something like, maximal simplicity within the constraint of the task at hand. Quoting Einstein:

It can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience.

The mathematician seeks contexts where the simplest concepts adequate to the tasks of the context are themselves interesting. Repeatedly seeking in that way builds up a library of canonical concepts, which are applied in new contexts.

Extreme canonicity is far from necessary. It does not always hold, even in abstract mathematics (and might not ever hold). For example, quoting:

Our colleague Peter Gates has called category theory “a primordial ooze”, because so much of it can be defined in terms of other parts of it. There is nowhere to rightly call the beginning, because that beginning can be defined in terms of something else.

The same concept might be redescribed in a number of different ways; there's no single canonical definition.

Canonicity could be viewed as a sort of extreme of fungibility, as if there were only one piece of gold in the cosmos, so that having a piece of gold is trivially fungible with all ways of having a piece of gold. All ways of comprehending a concept are close to fungible, since any definition can be used to reconstruct other ways of understanding or using the concept. (This is far from trivial in individual humans, but I think it holds fairly well among larger communities.)

Compare also Christopher Alexander's notion of a pattern and a pattern language.

Effects on goal structure

Instrumental convergence

A convergent instrumental goal is a goal that would be pursued by many different agents in service of many different supergoals or ultimate ends. See Arbital "Instrumental convergence" and a list of examples of plausible convergent instrumental strategies.

Or, as a redefinition: a convergent instrumental goal is a goal with a large motivator set. ("Large" means, vaguely, "high measure", e.g. many goals held by agents that are common throughout the multiverse according to some measure.) Taking this definition literally, a goal-state might be "instrumentally convergent" by just being a really large set. For example, if is [do anything involving the number 3, or involving electricity, or involving the logical AND operation, or involving a digital code, or involving physical matter], then a huge range of goals have some instantiation of as a subgoal. This is silly. Really what we mean is to include some notion of naturality, so that different instantiations of are legitimately comparable. Fungibility is some aspect of natural, well-formed goals: any instantiation of should be about as useful for supergoals as any other instantiation of .

Canonicity might be well-described as a combination of two properties: well-formedness (fungibility, Thingness), and instrumental convergence.

Fungibility spirals

A goal-state is enfungible if agents can feasibly make more fungible.

Suppose is somewhat convergent and somewhat enfungible, even if it's not otherwise very fungible / natural. Then agents with goals in the motivator set of will put instrumental value on making more fungible. That is, such agents would find it useful (if not maximally useful) to enfunge , to make it more possible to use different instantiations of towards more supergoals, because for example it might be that is best suited to an agent's supergoals, but is easiest to obtain.

If is enfunged, that increases the pool of agents who might want to further enfunge . For example, suppose agents have goals , and suppose that the aren't currently fungible, and that is easier to obtain than when . First works to make fungible into . Once that happens, now both and want to make fungible into . And so on.

In this way, there could be a fungibility spiral, where agents (which might be "the same" agent at different times or branches of possibility) have instrumental motivation to make very fungible. An example is the history of humans working with energy. By now we have the technology to efficiently store, transport, apply, and convert between many varieties of energy (which actions are each an implementation of fungibility). As another example, consider the work done to do computational tasks using different computing hardware.

Existence

Evidence for the existence of convergent instrumental goals:

Effects on goal structure

The inspection paradox for subgoals

The Friendship paradox: "most people have fewer friends than their friends have, on average". Relatedly, the inspection paradox:

Imagine that the buses in Poissonville arrive independently at random (a Poisson process) with, on average, one bus every six minutes. Imagine that passengers turn up at bus-stops at a uniform rate, and are scooped up by the bus without delay, so the interval between two buses remains constant. Buses that follow gaps bigger than six minutes become overcrowded. The passengers' representative complains that two-thirds of all passengers found themselves on overcrowded buses. The bus operator claim, "no, no--only one third of our buses are overcrowded." Can both these claims be true?

Analogously:

An agent's subgoals tend to be especially instrumentally convergent.

Very instrumentally convergent goals are held by more agents than non-convergent goals, so, just like a passenger who's usually subjected to an overcrowded bus, an agent holding some subgoal is usually holding an overcrowded subgoal. In particular:

Different agents tend to have many overlapping convergently instrumental goals.

Hidden ultimate ends

[I got tired of writing this essay, so this section is less thoroughly thought out / expressed / distilled. It's probably the most important section though. Feel free to ask questions at my gmail address: tsvibtcontact]

Suppose we're observing a mind thinking and acting. What can we tell about its ultimate goals?

Are convergent instrumental goals near-universal for possibilizing?

Possibilizing is making things possible, in contrast to actually doing things. This might not be a principled distinction, since whatever is called possibilizing is surely also actualizing something, but still there's some intuitive spectrum. E.g. speed cubers first look at the Rubik's cube, then put it down, and then pick it up and turn its faces. Before they pick it up the second time, in their head they're possibilizing solving the cube by figuring out the sequence of moves. When they pick it up, they actualize solving the cube. An engineer designing a machine possibilizes the work the machine will do; building the machine is midway between possibilizing and actualizing the work; applying the machine actualizes the work. See possibilizing vs. actualizing.

Question:

To what extent are most goals mostly possibilizable using only highly convergent subgoals?

In other words: given some random supergoal , how close can you get to achieving --how little additional work can you leave undone--if you're only allowed to behave in ways that many agents would behave? I don't know. I suspect you can get pretty close. Some intuitions behind that suspicion:

Non-adaptive goal hiding

Suppose the mind is not adapted in any way to stop us from understanding the mind's ultimate goals. Can we determine what the mind's ultimate goals are? Some obstacles:

Adaptive goal hiding

Suppose an agent adapts in some way so that it isn't visibly pursuing goals that an overseer wouldn't want it to pursue. The adaptation could be passive, e.g. we apply a training loss to thoughts or actions that score highly on some detector of having certain goals, or it could be active, e.g. itself is searching out ways to arrange itself so that we don't interfere with its goals.

4 comments

Comments sorted by top scores.

comment by Roman Leventov · 2023-04-04T01:58:17.230Z · LW(p) · GW(p)

In popular perception, this is how China is acting in international relations: it mostly pursues instrumental convergent goals like increasing its influence, accumulating resources, etc., while just holding its strategic goals "in the background", without explicitly "rushing" to achieve them right now.

This reminds me of Sun Tzu's saying, “If you wait by the river long enough, the bodies of your enemies will float by.”

Does such a strategy count as misalignment? If the beliefs held by the agent are compatible[1] with the overseer's beliefs, I don't think so.  Agent's understanding of the world could be deeper and therefore its cognitive horizon and the "light cone" of agency and concern (see Levin, 2022;  Witkowski et al., 2023) farther or deeper than those of the overseer.

Then, the superintelligent agent could evolve its beliefs to the point that is no longer compatible with the "original" beliefs of the overseer, or even their existence. In the latter case, or if the agent fails to convince the overseers in (a simplified version of) their new beliefs, that would constitute misalignment, of course. But here we go out of scope of what is considered in the post and my comment above.

  1. ^

    By "compatible", I mean either coinciding or reducible with minimal problems, like general relativity could be reduced to Newtonian mechanics in certain regimes with negligible numerical divergence. 

Replies from: TsviBT
comment by TsviBT · 2023-04-09T13:41:18.290Z · LW(p) · GW(p)

Thanks, that's a great example!

Does such a strategy count as misalignment?

Yeah, I don't think it necessarily counts as misalignment. In fact, corrigibility probably looks behaviorally a lot like this: gathering ability to affect the world, without making irreversible decisions, and waiting for the overseer to direct how to cash out into ultimate effects. But the hidability means that "ultimate intents" or "deep intents" are conceptually murky, and therefore not obvious how to read off an agent--if you can discern them through behavior, what can you discern them through?

Replies from: Roman Leventov
comment by Roman Leventov · 2023-04-09T20:06:17.194Z · LW(p) · GW(p)

Only if we know the entire learning trajectory of AI (including the training data) and high-resolution interpretability mapping along the way. If we don't have this, or if AI learns online and is not inspected with mech.interp tools during this process, we don't have any ways of knowing of any "deep beliefs" that AI may have, if it doesn't reveal them in its behavior or "thoughts" (explicit representations during inferences)

comment by gpt4_summaries · 2023-04-03T08:24:54.214Z · LW(p) · GW(p)

Tentative GPT4's summary. This is part of an experiment. 
Up/Downvote "Overall" if the summary is useful/harmful.
Up/Downvote "Agreement" if the summary is correct/wrong.
If so, please let me know why you think this is harmful. 
(OpenAI doesn't use customers' data anymore for training, and this API account previously opted out of data retention)

TLDR: This article explores the challenges of inferring agent supergoals due to convergent instrumental subgoals and fungibility. It examines goal properties such as canonicity and instrumental convergence and discusses adaptive goal hiding tactics within AI agents.

Arguments:
- Convergent instrumental subgoals often obscure an agent's ultimate ends, making it difficult to infer supergoals.
- Agents may covertly pursue ultimate goals by focusing on generally useful subgoals.
- Goal properties like fungibility, canonicity, and instrumental convergence impact AI alignment.
- The inspection paradox and adaptive goal hiding (e.g., possibilizing vs. actualizing) further complicate the inference of agent supergoals.

Takeaways:
- Inferring agent supergoals is challenging due to convergent subgoals, fungibility, and goal hiding mechanisms.
- A better understanding of goal properties and their interactions with AI alignment is valuable for AI safety research.

Strengths:
- The article provides a detailed analysis of goal-state structures, their intricacies, and their implications on AI alignment.
- It offers concrete examples and illustrations, enhancing understanding of the concepts discussed.

Weaknesses:
- The article's content is dense and may require prior knowledge of AI alignment and related concepts for full comprehension.
- It does not provide explicit suggestions on how these insights on goal-state structures and fungibility could be practically applied for AI safety.

Interactions:
- The content of this article may interact with other AI safety concepts such as value alignment, robustness, transparency, and interpretability in AI systems.
- Insights on goal properties could inform other AI safety research domains.

Factual mistakes:
- The summary does not appear to contain any factual mistakes or hallucinations.

Missing arguments:
- The potential impacts of AI agents pursuing goals not in alignment with human values were not extensively covered.
- The article could have explored in more detail how AI agents might adapt their goals to hide them from oversight without changing their core objectives.