Dynamics Crucial to AI Risk Seem to Make for Complicated Models

post by VojtaKovarik, Ida Mattsson (idakmattsson) · 2024-02-21T17:54:46.089Z · LW · GW · 0 comments

Contents

  Laundry List of Dynamics Closely Tied to AI Risk
    I. Difficulty of specifying our preferences[2]
    II. Human extinction as a convergent byproduct of terminal goals[3]
    III. Human extinction as a convergently instrumental subgoal[4]
    IV. Most attempts to constrain an AI's actions fail for superintelligent AIs[5]
  Popular AI Benchmarks Seem to Lack Dynamics Relevant to AI-risk
  Are Informative Models Necessarily Complicated?
None
No comments

This post overlaps with our recent paper Extinction Risk from AI: Invisible to Science?.

tl;dr: In AI safety, we are worried about certain problems with using powerful AI. (For example, the difficulty of value specification, instrumental convergence, and the possibility that a misaligned AI will come up with takeover strategies that didn't even occur to us.) To study these problems or convince others that they are real, we might wish to describe them using mathematical models. However, this requires using models that are sufficiently rich that these problems could manifest in the first place.
In this post, I suggest thinking about how such "rich-enough" models could look like. Also, I raise the possibility that models which are rich enough to capture problems relevant to AI alignment might be too complex to be amenable to a rigorous analysis.

Epistemic status: Putting several related observations into one place. But I don't have strong opinions on what to make of them.


In the previous post [LW · GW], I talked about "straightforwardly evaluating" arguments by modelling the dynamics described in those arguments. In this post, I go through some dynamics that seem central to AI risk. However, none of these dynamics is meant to be novel or surprising. Instead, I wish to focus on the properties of the mathematical models that could capture these dynamics. How do such models look like? How complicated are they? And --- to the extent that answering some questions about AI risk requires modeling the interplay between multiple dynamics --- is there some minimal complexity of models which can be useful for answering those questions?

Laundry List of Dynamics Closely Tied to AI Risk

In this section, I list a number of dynamics that seem closely tied to AI risk, roughly[1] grouped based on which part of the "AI risk argument" they relate to. Below each part of this list, I give some commentary on which models might be useful for studying the given dynamics. I recommend reading selected parts that seem interesting to you, rather than going through the whole text.

For the purpose of skimming, here is a list of the dynamics, without any explanations:

I. Difficulty of specifying our preferences[2]:

  1. Human preferences are ontologically distant from the laws of physics.
  2. Human preferences are ontologically distant from the language we use to design the AI.
  3. Laws of physics are unknown.
  4. Human preferences are unknown.

II. Human extinction as a convergent byproduct of terminal goals[3]:

  1. The world is malleable.
  2. The world is made of resources.
  3. Humans evolved to require a narrow range of environment conditions

III. Human extinction as a convergently instrumental subgoal[4]:

  1. The environment has been optimised for our preferences.
  2. Humans are power-seeking.
  3. Power is, to some extent, zero-sum.

IV. Most attempts to constrain an AI's actions fail for superintelligent AIs[5]:

  1. Specifying restrictions is difficult for the same reasons that value specification is difficult.
  2. The AI can act by proxy.
  3. The AI can exploit novel strategies and technologies.
  4. The AI, and everything constraining it, is fully embedded in the environment.

I. Difficulty of specifying our preferences[2]

A key part of worries about AI risk is that formally writing down what we want --- or even somehow indirectly gesturing at it --- seems exceedingly difficult. Some issues that are related to this are:

  1. Concepts that are relevant for specifying our preferences (e.g., "humans'" and "alive'') on the one hand, and concepts that are primitive in the environment (e.g., laws of physics) on the other, are separated by many levels of abstraction.
  2. Consider the ontology of our agents (e.g., the format of their input/output channels, or the concepts they use for "thinking"). Then this ontology is (a) also far removed from fundamental physics and (b) different from our ontology.
  3. We don't even know what is the correct language in which to describe low-level reality (ie, fundamental physics).
  4. Even ignoring all of the above problems, we do not actually know what our preferences are.

If we want to use mathematical models to learn about these issues, we need to use models in which these issues can arise in the first place. I am not sure how a good model would look like. But I can at least give a few illustrative comments:

II. Human extinction as a convergent byproduct of terminal goals[3]

The above-mentioned difficulty of specifying our preferences is, by itself, not dangerous. For example, it could, hypothetically, be possible that we never figure out how to get AI to do what we want, but all of the resulting failures are benign. (Perhaps "sleeping all day long" is a convergently instrumental goal of any strong optimiser?)

However, the "difficulty of specifying our preferences" problem does become dangerous when combined with other issues. One such issue is that, conceivably, most terminal goals[6] are such that doing well according to them results in the extinction of humanity as a byproduct. Some claims closely tied to this issue are:

  1. Essentially anything in the environment can be dismantled for parts or energy.
  2. Access to materials and energy seems, all else being equal, instrumantal for a wide range of goals.
  3. Humans (and life on Earth in general) have been heavily optimised the current conditions. As a result, significant disruptions to the environment are likely to be fatal for us.

Some notes on models that can be informative regarding these dynamics:

III. Human extinction as a convergently instrumental subgoal[4]

Another issue that can prove fatal, when combined with inability to accurately specify our preferences, is that even a terminal goal doesn't cause human extinction as a byproduct, many goals might still make it instrumental to get rid of humanity. Some related claims are:

  1. The current state of the world has already been heavily optimised according to human preferences. This means that, all else being equal, any disruptions of the environment are likely to be undesirable from our point of view.
  2. (i) The preferences of many humans, and probably humanity as a whole, include strong opinions on how the universe should look like in the future.
    (ii) As a result, humanity is likely to be in conflict with any agent that wants to transform the environment (even non-fatally), or that is power-seeking[8].
  3. For a wide range of goals, the best way to prevent human interference is to effectively exterminate humanity.

Some comments on capturing these dynamics in mathematical models:

IV. Most attempts to constrain an AI's actions fail for superintelligent AIs[5]

Finally, some people concede that AI might form the "desire" to act in ways that are harmful to humanity, but they suggest that we might be able to prevent the AI from acting on these desires. This could be done either by physically preventing the AI from taking certain actions (eg, not plugging it to the internet) or by modifying the AI's design (eg, giving the AI a negative reward for connecting to the internet). However, there are worries that a sufficiently capable AI would be able to circumvent any such restrictions. Some proposed ways an AI could do this are:

  1. The AI could exploit loopholes the specification in any internal restrictions placed on it.[9]
  2. The AI could act through others --- tools that can be used, weaker agents that can be instructed, or powerful agents and optimisation processes that can be nudged.
  3. The AI could also use strategies that take us by surprise, such as inventing new technologies, exploiting laws of physics we are not aware of yet, or simply taking actions its designers didn't think of.
  4. The AI can exploit the fact that it, and everything affecting it, is embedded in the world. That is, it can attack the physical implementations of any restrictions -- both external and internal.

As before, the key question we want to ask is which mathematical models would allow us to learn about these issues. Here are some thoughts on this topic:

As an exercise, I find it helpful to go through your favourite models (AI environments, benchmarks, etc), and think about which of the above-mentioned dynamics can possibly arise in them.

I recommend being somewhat explicit about which interpretation of the model at hand you are considering: For example, it is valid to think about game-theoretical chess as a model of Alice and Bob playing a game of chess in the real world. It is also valid to view chess as a (rather inaccurate) metaphor for war between two medieval kingdoms. But we should avoid the motte-and-bailey fallacy where we shift between the interpretations. (EG, the game-theoretical model of chess can be accurate when used to model real-world chess. Or it can allow for people dying, when used as a metaphor for war. But claiming that it is accurate and allows for people dying would be fallacious.)

More generally, we want a model that is both (i) somewhat accurate and (ii) rich enough to model some version of the AI-risk argument. And, across all models that I am aware of, models which satisfy (ii) make no attempt at (i) --- that is, they are good enough to conclude that "if the world worked like this, AI risk would be an issue". But they are not informative as to whether the world actually is like that or not.
However, my search for good models has not been very exhaustive. So if you can think of any relevant models, please share them!

Are Informative Models Necessarily Complicated?

Finally, when trying to come up with mathematical models to investigate certain alignment proposal, I kept encountering the following issue: I wanted to have a model that is as simple as possible, such that I can rigorously analyse what the algorithm will do. And I came up with models where the proposal would work. But then it turned out that in those models, one could just as well use RLHF (which is easier to implement and more competitive). In other words, these models were too simple and did not manifest the problems that the algorithm was trying to solve. But when I imagined extensions of the models, rich enough that the problems could arise there, the resulting models were always too complicated for me to analyse the algorithm formally. And instead of doing formal math, I was back to vaguely gesturing at things like "but the your goal specification will have loopholes".[11]

This experience leads me to conjecture that perhaps there is something more fundamental at play. That is, I imagine that for every problem one might wish to study, there is some lower bound on the complexity of models that are still informative enough that studying them can tell us interesting things about reality. And perhaps the topic of AI risk is such that models that can teach us something about it are necessarily very complicated? Perhaps even complicated enough that formally analysing them is practically infeasible for us, at least at this point?

I am not fully sold on this point. For example, I expect that we might be able to, somehow, cleverly decompose the AI-risk question into mostly-independent parts, and study each of them separately, using sufficiently simple models. But I think that the possibility --- that some problems are real, yet too complex to be amenable to formal analysis --- is worth taking seriously.


Acknowledgments: Most importantly, many thanks to @Chris van Merwijk [LW · GW], who contributed to many of the ideas here (but wasn't able to look over this post, so I am not sure if he approves of it or not). I would also like to thank Vince Conitzer, TJ, Caspar Oesterheld, and Cara Selvarajah (and likely others I am forgetting), for discussions at various stages of this work.

  1. ^

    The list of AI-risk dynamics is not meant to be exhaustive or authoritative. The goal is to give an intuition for how mathematical models (used to studying AI risk) could look like.

  2. ^

    One intuition for this is that human values are complex (see for example The Hidden Complexity of Wishes [LW · GW]). However, an independent problem is the difficulty of robustly pointing an optimiser towards anything at all (see Diamond maximizer).

  3. ^

    For some intuitions, see Edge instantiation.

  4. ^

    For an intuition, see the Catastrophic Convergence Conjecture [LW · GW].

  5. ^

    For some intuitions, see Patch resistance and Nearest unblocked strategy.

  6. ^

    Note that the issue [of human extinction as a by-product of achieving terminal goals] is not something that can be easily solved by approaches such as "just use virtue ethics". That is, the issue arises for any system that optimises a goal. And this can include goals such as "follow this operationalisation of virtue ethics", as long as there is (for example) a strong selection pressure towards policies which result in fewer cases where the system fails to follow the rules. (After all, following virtue ethics is likely to be much easier if all the humans and other moral patients are dead.)

  7. ^

    However, perhaps it could be interesting to add entropy-like dynamics into environments, in order to give agents convergently instrumental drives for resource acquisition. (For example, in some computer games, constant monster attacks can serve as pressure for accruing resources to use for defense.)

  8. ^

    Another way to view (2) is that humans are power-seeking, and power seems zero-sum (beyond some level). So any powerful power-seeking agent might find it instrumental to get rid of humanity preemptively.

  9. ^

    Specifying the restrictions without any loopholes can be difficult for the same reasons that it can be difficult to specify our "primary" preferences.

  10. ^

    If we imagine that some patterns can be viewed as agents.

  11. ^

    I was able to come up with models where, for example, my favourite algorithm works, RLHF doesn't work, and the model is still simple enough to analyse formally. However, this always came at the cost of realism --- I had to give up any aspirations for the model to be a faithful (though extremely simplified, and thus inaccurate) representation of reality.

0 comments

Comments sorted by top scores.