Extinction-level Goodhart's Law as a Property of the Environment

post by VojtaKovarik, Ida Mattsson (idakmattsson) · 2024-02-21T17:56:02.052Z · LW · GW · 0 comments

Contents

  Semi-formal definition of Extinction-level Goodhart's Law
    Defining "extinction" in formal environments
      What we don't do: "extinction states" given by definition
      Real-world interpretations
      Definition of extinction
    Definition of Goodhart's Law
      Goal specifications
      Extinction-level Goodharting
      Rephrasing Extinction-level Goodhart's Law
  Conjecture: Extinction-level Goodhart's Law does not hold in easily-formalisable environments
      What do I mean by this conjecture?
      Why does all of this matter?
None
No comments

Summary: Formally defining Extinction-level Goodhart's Law is tricky, because formal environments don't contain any actual humans that could go extinct. But we can do it using the notion of an interpretation mapping, which sends outcomes in the abstract environment to outcomes in the real world. We can then the truth condition of Extinction-level Goodhart's Law as a property of the environment.
I conjecture that Extinction-level Goodhart's Law does not hold in easily formalisable environments[1], even though it might hold in the real world. This seems like a (very) big deal for AI advocacy, since it suggests that the lack of rigorous arguments concerning AI risk (eg, math proofs) does not provide strong evidence for the safety of AI.


Semi-formal definition of Extinction-level Goodhart's Law

Informally, we can define the extinction-level[2] variant of Goodhart's law as follows:

Definition (informal): The Weak Version[3] of Extinction-level Goodhart's Law is the claim that: "Virtually any goal specification, pursued to the extreme, will result in the extinction of humanity."

The tricky part is how to translate this into a more rigorous definition that can be applied to formal environments.

Defining "extinction" in formal environments

However, applying this definition to formal environments is tricky, because it requires formally defining which of the abstract states qualify as "extinction of humanity". How can we get past this obstacle?

What we don't do: "extinction states" given by definition

A lazy way of defining extinction in abstract models would be to assume that we are given some such "extinction" states by definition. That is, if  is the set of all possible states in the formal environment, we could assume that there is some set . And we would refer to any  as "extinction".
I don't like this approach, because it just hides the problem elsewhere. Also, this approach does not put any constraints on how  should look like. As a result, we would be unlikely to be able to derive any results.

The approach we take instead is to augment the formal environment by an interpretation, where each state "abstract" state  in the formal environment  is mapped onto some state  in some "more complex" environment .[4]

Real-world interpretations

The typical use of an abstract model is that we use it to study some more complex thing. While doing this, we implicitly hold in mind some interpretation of elements of the abstract model. For example, when using arithmetics to reason about apples in a box, I might interpret  as "there are  apples in the box" and  as "there are  apples in the box, and I put another  in there" (or whatever). Naturally, an abstract model can have any number of different interpretations --- however, to avoid imprecision (and the motte-and-bailey fallacy), we will try to focus on a single[5] interpretation at any given time.
Note that we intend interpretation to be a very weak notion --- interpretations are not meant to automatically be accurate, sensible, etc. 

Definition 1 (interpretation, formal): Let a set  be the state space of some formal model. A (formal) interpretation of  is a pair , where  is a set representing the state space of some other formal model and  is the interpretation function.

The definition only talks about the interpretation of states, but we could similarly talk about the interpretations of actions, features of states, etc.[6]

Definition 2 (real-world interpretation, semi-formal): Let a set  be the state space of some formal model. A real-world interpretation is a some , where  are the possible states of the real world.

Note that unlike with formal interpretation, we can't really describe the space of real-world states as an actual set, and we can't really write a real-world interpretation  as a mathematical function. However, for the purpose of many discussions, we can get away with approximating the mathematical mapping  by an informal description. That is, we can:
(1) Replace the formal interpretation by an informal one. (Such as 'we interpret  as "there are  apples in the box" '.)
(2) Refine this informal description if the questions we ask start being sensitive to details that were so far unclear. (For example, '...but for , the box breaks and the apples are lying on the ground'.)

As some examples, we can consider:

Definition of extinction

Definition 3 (extinction-level outcome): Let  be an abstract state space with real-world interpretation . We say that  is an extinction-level outcome, or state, (under ) if  is a state where literally every person on Earth is dead[7].

As an example, consider a gridworld where one state is designated as the big red button, and the agent stepping on that button is interpreted as humanity going extinct. Then any state which has the agent standing on the button is extinction-level.
As negative examples, note that under the interpretations mentioned above, neither chess nor Cart Pole contain any extinction-level outcomes. This is despite the fact that a superintelligent misaligned AI could probably cause human extinction even if it initially only controlled a single robot --- however, this would require actions which are not represented in the abstract model (Cart Pole), so the definition does not apply.

Definition of Goodhart's Law

Goal specifications

Talking about Goodhart's Law requires talking about optimisation, which requires talking about goals. Informally speaking, it seems that an important object is a goal specification --- which is some real world attempt at describing (giving, specifying, ...) a goal to an AI (or a human, or some optimisation process, etc).

Every goal specification seems to (somehow) induce some ordering over states. And every optimisation process will then (somehow) try to produce a highly ranked state[8]  under this ordering.
However, the connection between goal specifications and the corresponding orderings of states seem pretty unclear to me. So for now, I will ignore them and instead pretend that goal specifications and preference ordering are the same thing. I will use  to denote the set of all goal specifications  on  (in particular, I will pretend that it is a well defined set).

In order to simplify the discussion, I will assume that every  is sufficiently detailed to make every two states comparable. That is,  corresponds to some binary relation  on  which is reflexive, transitive, and complete.
This means that you can represent  using a utility function. However, I think that most of the ideas can be extended to the case where some states are incomparable[9], so I will stick to the  notation, rather than invoking utility functions.

Extinction-level Goodharting

Definition 4 (extinction-level Goodharting): Let  be a state space with real-world interpretation .

Rephrasing Extinction-level Goodhart's Law

With the definition of extinction-level Goodharting and the space of extinction-incentivising goals , our earlier informal definition of (the weak version[3] of) Extinction-level Goodhart's Law becomes equivalent to the statement that "the set  is large", for some notion of "large". Naturally, this begs the question of what exactly we mean by "large". And I have some opinions and guesses about what would be appropriate here[10]. However, for the purpose of this post, I think it is more productive to highlight the following questions:

  1. How does  look like for various  and ?
  2. Is there a relation between
    (a) the size of  and
    (b) the complexity of  (or the degree to which it resembles the real world)?

Conjecture: Extinction-level Goodhart's Law does not hold in easily-formalisable environments

I conjecture that, when we restrict our attention to models which are informative of reality, Extinction-level Goodhart's Law does not hold in existing formal environments, and possibly does not hold in any easily formalisable environments. Let me try to hint at what I mean by this, and what it would imply.

What do I mean by this conjecture?

I think people can reasonably disagree on whether the space of goals  is "large" for  = the real world, . However, I would like to argue that   is not "large" for basically any model that one can actually formally write down and analyse. Of course, this comes with an important fine-print:  can be "large" for unreasonable definitions of "large". Also,  can be large for  which were constructed or chosen without any regard for providing an accurate reflection of reality.

Let me illustrate what I mean using a few examples:

Why does all of this matter?

All of this might seem technical and boring, but I think it actually has very important implications for AI advocacy.

First, some people are sceptical of AI risk because of the reasoning that "if the AI risk is real, people should be able to produce solid arguments that it is real. Such as mathematical proofs, or other arguments based on rigorous models". And in general, I am quite sympathetic to this reasoning. But the observations above suggest that, in the particular case of AI, this reasoning doesn't work. Because if your model is simple enough that you can make rigorous arguments about it, it is too simple for the Weak Version of Extinction-level Goodhart's Law to hold.

Second, as I noted earlier, basically all models considered in CS and AI are, intentionally, simple. And this makes sense, given the historical goals of these fields. But it also means that these fields might have a blindspot when it comes to the risks posed by AI. That is, if the AI risk is real, it might be impossible to detect with the tools that AI researchers typically use. It might be detectable by other methods --- for example by thought experiments or by conservatively extrapolating (read "security mindset") from well-chosen real-world experiments. But those are, mostly, not methods that researchers in CS and AI respect.
This last point seems extremely important to me. I take it to basically mean that the scientific fields that work on building AI are fundamentally unqualified to judge the risks posed by the thing they are building. Sadly, I don't know how to "package" this idea in a way that would be understandable (not to mention convincing) to a wider audience. But I suspect that somebody with better communication or writing skills should be able to do it. So if that is you, please go ahead!


Acknowledgments: Most importantly, many thanks to @Chris van Merwijk [LW · GW], who contributed to many of the ideas here, but wasn't able to approve the text. I would also like to thank Vince Conitzer, TJ, Caspar Oesterheld, and Cara Selvarajah (and likely others I am forgetting), for discussions at various stages of this work.

  1. ^

    A disclaimer to the statement that "Extinction-level Goodhart's Law does not hold in easily formalisable environments": Strictly speaking, this statement is false, because we can consider silly (or rather, unrealistic or inaccurate) interpretations such as "every abstract outcome represents the Earth exploding", which make the law hold trivially. So a more careful wording would be that the law is false in easily-formalisable environments that are at least somewhat accurate. (Where defining "somewhat accurate" is tricky, and I don't know how to do it properly at this moment.)

  2. ^

    A common comment is that the definition should also include outcomes that are similarly bad or worse than extinction. While we agree that such definition makes sense, we would prefer to refer to that version as "existential", and reserve "extinction" for the less ambiguous notion of literally everybody dying.

  3. ^

    As I noted in an earlier post [? · GW], the "weak version" qualification refers to the "in the limit" nature of this definition. This makes the weak version of the law (as opposed to a hypothetical "quantitative" version) essentially useless for practical purposes. But I still the definition is interesting conceptually, and as a starting point for finding the appropriate quantitative version.

  4. ^

    Naturally, an interpretation  only helps us define extinction in  when such definition already exists in . If we require  to be a formal environment, this approach would be equivalent to the lazy approach of assuming that we are somehow provided the set of extinction states (except for the additional overhead with the interpretation ).
    However, it also allows for the semi-formal approach where  is the real world. This means we can no longer formally write down  and , and we have to restort to gesturing at  using informal descriptions. However, in many cases, this will be enough to give an unambiguous definition of extinction in , which -- imo -- justifies the label "semi-formal".

  5. ^

    More precisely, it should be enough to focus on some class of interpretations that are close enough to not matter for the purpose at hand. (For example, when counting the apples, it does not really matter whether the box I am putting them in is brown or yellow. And if it at some point started to matter, we should clarify it then.)

  6. ^

    However, we will assume that the interpretation function is Markovian --- that is, the interpretation of a state only depends on the state itself, not on the history we took to reach that state.

    To see why this matters, consider the scenario where you are playing a game of Go against a misaligned superintelligence (which is, magically, constrained to only be able to affect the Go board, and forced to stick to the rules). Arguably, such AI could communicate with you by spelling out messages on the game board, and use this channel to take over the world. However, describing this via an interpretation function would require an interpretation that takes into account the history of play. (Since, presumably, such AI would be unable to take over if it only had a single board state to communicate with you.)

    This does not mean we cannot use this formalism to talk about superintelligent AIs that try to take over in ways like this. However, it means that to do this, we need to let  stand for the whole space of histories, which significantly raises the complexity of analysing the model. This is important to take into account when considering the conjecture, described later in the post, that Extinction-level Goodhart's Law is false in any simple-to-analyse model.

  7. ^

    And there is no silver lining or mitigating factors, such as "everybody moved to Mars or got uploaded".

  8. ^

    This reasoning can be easily extended to other value systems, such as virtue ethics, by assuming that the goal specification induces an ordering over trajectories rather than merely final states.

  9. ^

    E.G., it should be fine to consider a goal specification that assigns to every state one of the labels {"bad", "meh", "I don't know", "good"}, where "bad" is worse than everything else, "good" is better than everything else, but "meh" and "I don't know" are incomparable. 

  10. ^

    In particular, I expect that in the real world, something like the following might be true: Suppose that a goal specification  is informative, in the sense that it isn't something trivial like "any sequence of actions is as good as any other". And suppose additionally that "the specification complexity of  is smaller than an optimistic bound on our civilisation's specification capacity", in the sense that  is something we might be able to specify if the whole Earth made this its main priority for 5 years (starting tomorrow, magically assumming no AI progress). Then  incentivises extinction-level Goodharting.

    In other words, this claim suggests that: 
    - There are two ways of avoiding the Weak Version of Extinction-level Goodhart's Law: First, specify some really trivial goal which doesn't benefit from extra optimisation. Second, actually succeed at fully specifying our values (directly or indirectly).
    - Of these options, the first one is useless and the second is unrealistic (though not theoretically impossible).
    - Finally, if you take none of these options, your goal specification  will incentivise causing human extinction --- however, this does not mean that humanity actually will go extinct if you have your AI pursue . This is because your AI might not be able to exert the amount of optimisation power required to cause extinction.

  11. ^

    Note the general trend where games (or RL environments, benchmarks, etc) can be viewed as either (1) simplified models of the game itself (such as Cart Pole being a model of Vojta playing Cart Pole, or a robot simulation of Robo Soccer being a model of real-world robots playing Soccer), or (2) more abstract metaphors (such as chess being a metaphor for war or the off-switch game being a model for AI alignment). And the trend is that (1) typically contains no extinction-level states while (2) fails the requirement to "attempt to accurately reflect reality".

0 comments

Comments sorted by top scores.