Steering systems

post by Max H (Maxc) · 2023-04-04T00:56:55.407Z · LW · GW · 1 comments

Contents

  Preface: who this post is for
  Introduction
  That which has many names
  Applying the term to existing systems
    System: Stockfish in the domain of a chess game
    System: Mu Zero, in the domain of a Go game
    System: an LLM-based task or problem-solving agent
    System: a human with some instructions and an internet-connected computer
  A sketch of a powerful steering system
    Why is this dangerous?
    Aside: a possible area of disagreement or differing intuitions
  How does alignment help?
  Conclusion
None
1 comment

Preface: who this post is for

This post is mainly targeted at people who are already familiar with the concepts and terminology in alignment, AI x-risk, and particularly the case for high probabilities of doom, but disagree with the arguments or have bounced off of them for one reason or another.

If you've read the 2021 MIRI conversations [? · GW] and remain confused about or disagree with some of the things Nate or Eliezer were saying, this post is for you.

If you understand and agree with everything Nate and Eliezer say in those transcripts already, you probably won't get many new insights from this post, but you may find it useful as a new way of explaining one aspect which (I think) is important to their case.

If you're not familiar with those discussions at all, you may still get value out of this piece, but there is a background assumption of familiarity with some of the concepts discussed there.

Introduction

This post is my own attempt at conveying some intuitions and concepts which various others have already written about at [? · GW] length [? · GW]. I doubt I'll do a better job than those before me, but it seems worth a shot anyway, because:

In short, I'm going to unpack the concept in this tweet:

https://twitter.com/ESYudkowsky/status/1639406023680344064

into some thought experiments that build an intuition for what "smarter than human" systems might look like, and explain why that intuition makes me think that humanity is on track to build and run powerful systems in ways likely to result in bad outcomes, even if some or all aspects of alignment and governance go unexpectedly (to me) well.

I'm going to introduce a new term, define it loosely, provide some examples, and apply it in the context of current systems and a hypothetical future system that performs tree search over world states.

I'll conclude with some implications about the different potential impacts of current alignment research approaches, depending on how this intuition differs from reality.

That which has many names

It has been called "powerful optimization process", "transformative AI", "superintelligence", and many other things. It has been speculated to have convergent instrumental subgoals [? · GW], deceptiveness, reflectivity, power-seeking [LW · GW], coherence [? · GW], utility maximization, and other dangerous or potentially undesirable properties.

To some, these may seem like questionable assumptions [LW · GW] or burdensome details [LW · GW].

However, the systems which I gesture at in this post need not have any of these properties (though they are permitted to). These properties are better thought of as possible implications that follow from relatively simpler assumptions:

Abstractly, the kind of system I have in mind is one which, given an initial set of actions it can execute, and a specification on outcomes, will choose actions which steer towards those outcomes in a way that is better than a smart human could do, given the same initial set of actions. I'll use a new term, "steering system", rather than existing terms like "optimization process", "AI", "TAI", "superintelligence", to emphasize the importance of the thinking about the thing I'm gesturing at as a system that is actually instantiated and run, rather than a model or algorithm in isolation. I'll name the scary thing as those systems which are strictly better than humans at steering in a real-world domain, though not necessarily way better.

We can ask of such systems:

The answers to these questions, about different classes of potential steering systems we're likely to build, correspond to different aspects of alignment and governance being easy or hard, and potential for wildly different futures, though none of them appear hopeful to me.

Applying the term to existing systems

What kinds of existing and future systems can be considered steering systems, and how can we answer the questions above?

Various systems comprised of deep neural networks, humans, or GOFAI algorithms are quite good at steering towards particular outcome states when restricted to particular domains.

System: Stockfish in the domain of a chess game

High-level Stockfish is superhuman at steering towards the set of board states where the opponent is checkmated. This is evident both by inspecting its source code and by playing some chess games against it.

The initial action set is the set of all legal chess moves from the initial board state. An interesting question is what happens if you artificially restrict Stockfish from using certain openings, or even moves in certain future board states. Does Stockfish manage to steer around its restrictions and still win the game?

System: Mu Zero, in the domain of a Go game

Mu Zero is superhuman at steering towards the set of board states where the AI's area score is higher than the opponent's. This is apparent by playing some Go games against it, and potentially from inspecting its source, though parts of the "source" are actually deep neural networks which are not so easily inspectable or interpretable.

System: an LLM-based task or problem-solving agent

A single text completion by an LLM is (probably) not a steering system in the domain of the real world. For almost any task you imagine, if that task is not constrained to the world of manipulating and outputting text strings, the system does not steer towards outcomes in a nontrivial way.

However, current LLMs, suitably prompted, chained, and composed with access to external APIs, can be composed into agent-like systems using tools like LangChain. Other examples: ChatGPT with Plugins, this task-driven agent I saw on Twitter.

Certain prompts (usually user-entered) in these systems encode the outcome specification; others encode the capabilities and methods the system uses to steer towards that specification.

How well do these agents steer? It depends on the task - if the task can be solved purely by manipulating and outputting text, or by making API calls to which the system is given access, the system may be able to steer quite well - as well or better than a human at the same tasks, and perhaps faster. Generally, these systems are still far below human-level at steering, in most meaningful ways.

How easy is it to see what the system is steering towards? Again, this depends on the particulars of the system: what underlying LLMs it is comprised of, how they are prompted, and how they are arranged.

The Open Agency model [LW · GW] suggests that for some instantiations of LLM-based agents, it may be easy to specify what they steer for and understand how they do so.

One important remark about LLM-based steering systems: most of the steering power only emerges at the very last stage of applying the foundation model.

Training GPT-4 was the work of hundreds of engineers and millions of dollars of computing resources by OpenAI. LangChain is maintained by a very small team. And a single developer can write a python script which glues together chains of OpenAI API calls into a graph. Most of the effort was in training the LLM, but most of the agency (and most of the useful work) comes from the relatively tiny bit of glue code that puts them all together at the end.

System: a human with some instructions and an internet-connected computer

Humans are good at steering across a variety of timescales and domains. A smart human can use a computer to have a large impact on the world, even if they start with very few resources. They can take a remote job as a programmer and then direct other humans to do things in the physical world using the money they make from that job.

Suppose a human in a room is given an initial set of instructions which represent a specification on outcomes to steer towards. Perhaps the instructions say "arrange for grocery delivery to the following address..." or "construct a house at the following address..." or "eradicate malaria worldwide". The human need not carry out the specified task themselves, they can (for example) take a remote job as a programmer and then use their salary to hire others to do the specified task or otherwise arrange to have it done.

The human might choose to interpret the instructions differently than the instruction-writer intended, or simply ignore them altogether according to their own whims. But a suitably incentivized human certainly seems capable of steering towards many of the possible outcome specifications described by the instructions.

How well this system steers depends on:

Note that for the other systems described in this post (Stockfish, Mu Zero, LLM agents), the human-based system is capable of steering at least as well or better than the AI-based system alone - given instructions to win a Go or chess game, a human who is not themselves good at chess or Go can use Mu Zero or Stockfish as a tool. (Stockfish is available freely on the internet, Mu Zero or a similarly good Go-playing system may require some effort to obtain access to.)

A sketch of a powerful steering system

This section is meant to sketch a hypothetical system which is composed of individual pieces which are not steering systems on their own, but when glued together in a particular straightforward way, are a steering system.

It's not necessarily a good design for a real system, but it is meant to illustrate a possible analogue of the kinds of things people do with LLMs and LangChain today that will be possible with future powerful "foundation models". The architecture is loosely inspired by the kinds of architectures in Mu Zero and Dreamer, but over a real-world domain instead of Go, Minecraft, or Atari games.

A one sentence description of the system is:

Perform tree search over world states, given an initial world state, initial action set, and an outcome specification.

In more detail:

The system is comprised of the following components, each of which may be individually as interpretable, transparent, and safe as you like, when used in isolation, though the decision-making and modelling pieces are assumed to be at least human level at their given individual tasks.

Each of these components may be implemented by a large neural network, GOFAI algorithms, a literal human (sometimes, e.g. in a human-in-the-loop setup), or any composition of such elements, as long as they perform their task at least as well and fast as a human could do so.

These components can then be arranged into performing a tree search with some simple glue code. Pseudo-python code:

W = deepAI.world_modeler_api()
E = deepAI.construct_evaluation_function_from_human_feedback(scored_example_worlds)
P = deepAI.pruning_heuristic_api()

# Look for a really good world
target_score = 10000000

def tree_search(world_state, choices):
  if E(world_state) > target_score:
  # We're done! print a stack trace and send a global interrupt or something
  # to halt execution across the rest of the tree
  return

  branches = P(world_state, E, C)

  parallelfor action in branches:

    # optional, but helpful if you don't want the world modeler
    # start to drift too far from reality...
    observation = execute(action)

    predicted_world_state, predicted_actions = W(world_state, action, observation)
    tree_search(predicted_world_state, predicted_actions)

initial_world_state = W()
initial_actions = playwright.chromium
tree_search(initial_world_state, initial_choices)

Remarks:

(spoiler-ed, in case you want to take some time to consider this system on your own first. Most of the remarks in the spoiler-ed section below are just my own intuitions and rough, disorganized thoughts about how the system described above might behave.)

  • Even though the initial possible branching factor is very large, a good pruning heuristic should be able to keep the branching factor low enough to make the tree grow sub-exponentially, and low-probability branches can be pruned entirely.

    For example, a human pruner might come up with a top 10 list of initial high level plans, and then branch to the initial action for each of those plans. At each of those first 10 child nodes, the pruner will then prune the actions at each node down to one - the one action that continues the current plan. Maybe after several actions, the pruner decides the current subtree it is at the root of looks promising for containing high-scoring worlds, and branches out into a few substrategies, or conversely, that it doesn't look promising, and prunes the branch entirely, freeing up resources for searches in other parts of the tree.

  • Most of the work in this system is done by the pruning heuristic, which is essentially being asked to plan a course of action that results in a world with a high score.

    But P need not be agentic or goal-directed itself, any more than asking a human (or prompting GPT-N) to:

    • Come up with a list of ten or so high level plans to reach child world states with a high score on the given evaluation function.

    • Return the first action for each plan in the list.

    Or, given a the current location in the tree (step in the current plan), pick the next action to take (or multiple actions to branch to).

    The pruning heuristic might need to contain some coarse-grained version of the world modeler within itself, in order to plan effectively. But humans are capable of planning without having a detailed all-encompassing world model, so by assumption the pruning heuristic in this system can do so as well.

  • When to actually execute an action is also left up to the pruning heuristic to decide, subject to the time and space constraints. In the limit of a perfect or extremely superhuman world modeler that can perfectly predict the results of any action, you don't need to execute any actions at all. But for more realistic scenarios where it is feasible to do so without stepping on toes of other parts of the tree or incurring real-world expenses, you want to execute the actions, in order to actually reach the world state you want to reach and prevent the predicted world models from drifting from the actual world too much.

  • What is the meaning of the pruning heuristic taking the world state as an input? This just means that the pruning heuristic can "see", in the human using a web-browser case, the human can see what page they've opened the browser to and whether it has loaded, or not.

    I personally could probably order something from Amazon with my eyes closed (or at least write down the complete steps), or write a script that uses Playwright to do so and have it work on the first try, without testing it, but I couldn't get too much further than that.

  • In the limit of infinite time and space and perfect world modelling, the pruning heuristic can be very dumb: it can just search over every possible subtree. The better heuristic the system has, the less time and space it needs to search for outcomes, and the more it starts to look like a strategic planner.

  • Note that the pruning heuristic might be composed of an LLM-based agent, prompted and arranged to plan tasks and choose actions.

  • For the world modeler, I'm imagining something like the world modeler in Dreamer that plays Minecraft or Atari games, generalized to the real world.

Why is this dangerous?

Maybe all the component pieces are perfectly interpretable, alignable, and non-agentic on their own. For suitably weak pieces and a judiciously-chosen evaluation function, the system safely performs useful steering work. But consider what happens when someone tries any of the following:

Any of the above have the potential to result in a system which is suddenly much better at steering, potentially in undesirable ways, even if the individual components are proved safe or "aligned" in isolation.

The system could suddenly become dangerously-capable, without any component piece of even the system as a whole necessarily developing anything like a mesaoptimizer [? · GW], reflectivity, deceptiveness, or other more exotic failure modes.

Furthermore, if the example of OpenAI and LangChain is any indication, it will be easy to glue together subsystems, compared to the work of training the underlying models.

Even if access to the models is kept private (in a break with current trends), it could be the work of literally a single researcher in a lab deciding to experiment, either on their own or in a sanctioned way, to build the kind of system proposed above out of sufficiently powerful foundation models.

These hypothetical researchers may be more responsible than the general public, but unless they are at an operationally adequate [LW · GW] organization, I am not reassured that they will "just not build it" or "just not run it" in sufficiently powerful regimes.

Aside: a possible area of disagreement or differing intuitions

One of the reasons I find this example, and the case for pessimism more generally, compelling, is an intuition that it is easy, in some absolute sense, to re-arrange the matter in the universe in arbitrary ways.

This intuition comes from looking at the world, seeing what humans have accomplished in the realm of biotech and nanotech and other areas of science and engineering, and considering what is permitted in principle by the laws of physics.

Fundamentally, these laws appear to permit a world that is much more malleable than say, the rules of chess permit in the universe of a chess game, or the reachable states permitted in an Atari game (absent exploitation of bugs in the underlying game engine).

Another way of framing this intuition is that I expect, not too deep in the tree, there are action sequences that involve inventing nanotech, biotech, computer viruses, using deception, and a variety of other strategies that are probably catastrophic if used to reach certain world states. These sequences might be currently out of reach to merely human-level steering systems, but the system above need not stick around at human-level for very long, and furthermore the truly dangerous strategies seem to me to be only slightly superhuman.

Others seem to have the intuition that things like nanotech, human manipulation, or other strategies require far more superhuman abilities to be really dangerous or uncontrollable. Perhaps this intuition is where the true crux is, though I think many of the intuitions about steering systems and implications that follow are useful, regardless of how difficult steering is in an absolute sense, or what the ultimate limits re-arranging capability are.

How does alignment help?

It may be easy to crisply describe an outcome specification and have such a system like the one described above search for outcomes which satisfy it. This corresponds to some aspects of alignment being easy - perhaps it turns out that "inner alignment" is not a problem at all, and the system as described above works perfectly well, given a sufficiently crisply and correctly specified evaluation function.

However, this implies that it is also easy to specify and give a bad evaluation function - an evaluation function that scores worlds composed entirely of tiny molecular squiggles highly is likely simpler to specify than one that captures the full complexity of human values.

Alternatively, perhaps the world modeler or pruning heuristic necessarily contain an inner-agent or mesaoptimizer once they are sufficiently capable, and these agents are not aligned with the outcome specification. In this case, we might run into the various problems of inner alignment.

Perhaps interpretability research will be useful for understanding how the pruning heuristic makes plans and chooses actions, or how the world modeler sees the world.

Or perhaps shard theory has something to say about how the system as a whole coheres, once it reaches a level of capability where it can search through action sequences that involve reflecting on itself.

Or perhaps some of the problems of embedded agency or corrigibility end up being useful for understanding how to make these systems safe.

I think there are many areas of alignment research which could be potentially be useful for making a steering system like the one above:

The fundamental issue is that all of this also makes it easier to build a system which doesn't have those nice properties.

I suppose this is where AI governance comes in. There is one recent proposal that, if implemented, might be sufficient to prevent such systems from being built and run at all. I suspect there are ways to build sufficiently powerful systems that can be trained and run with fewer resources than GPT-4 though, so even the proposal above is not actually sufficient to prevent sufficiently good steering systems from being built and run.

In any case, it seems extremely unlikely to actually be implemented, so it is not a source of hope for me.

Conclusion

The main intuition that I hope to convey with this post is about why I feel feel relatively pessimistic about AI x-risk, even if certain problems in alignment turn out to be easy, or if some of the implications of the world-models of the doomier people turn out to be wrong (e.g. even if it turns out the first systems we build are capable of steering well enough to be used pivotally, but don't have any scary properties like CIS, deceptiveness, agency, or utility maximization).

This intuition comes from a bunch of directions:

I think analyses like this [LW · GW] which argue for lower chances of doom, based on a bunch of arguments about how current AI systems are scaling, are not really grappling with the actual case for pessimism. Perhaps the real cruxes are in the intuitions about how difficult reality-rearranging is, in an absolute sense.

1 comments

Comments sorted by top scores.

comment by Liron · 2023-04-10T05:14:23.371Z · LW(p) · GW(p)

Great post! Agree with everything. You came at some points from a unique angle. I especially appreciate the insight of "most of the useful steering work of a system comes from the very last bits of glue code".