Prosaic misalignment from the Solomonoff Predictor

post by Cleo Nardo (strawberry calm) · 2022-12-09T17:53:44.312Z · LW · GW · 3 comments

Contents

  The Solomonoff Predictor is malign.
  The Monte-Carlo Predictor is malign.
    Three levels of simulator leaks.
  Prosaic misalignment from Simulator Leaks.
  Moral of the story
None
3 comments

When I first read Paul Christiano's post, I figured it had little relevance to prosaic alignment. But is that true? Is Solomonoff misalignment a problem that could actually arise on software running on GPUs over the next 10 years?

The Solomonoff Predictor is malign.

FACT 1: There's a certain type of machine  called a predictor. You tell  a bunch of facts  and then ask it a question . The predictor will then output the probability of  given .

FACT 2: The optimal predictor is the Solomonoff Predictor , for some natural sense of optimality.

What's the Solomonoff Predictor? [? · GW] Imagine every possible world is a binary string generated by a computer program, and imagine that the prior likelihood of a program  is . The Solomonoff Predictor corresponds to that prior.

In other words, 

where 

Fact 3: Paul Christiano worries that the Solomonoff Predictor is malign [LW · GW]. Why? Because some of these computer programs  will simulate "gremlins". These gremlins are consequentialist agents who care about influencing the output of , and they can influence the output of  by making  true in their universe.

Let's give a concrete example.

Suppose Alice finds a mysterious box with a big red button. She suspects the box makes delicious ice-cream, and so she feeds into the oracle  all the facts she's ever observed including all the facts about the box. Let's call all that data . And then she asks the oracle the question "is the box an ice-cream maker?". Let's call that question . If the oracle outputs a high probability, then she'll press the button anticipating delicious ice-cream.

Let's also suppose that unbeknownst to Alice, the box is actually a gremlin-generator that would unleash a bunch of gremlins.

Okay, what would Solomonoff Predictor output?

Well, there's going to exist some programs  for which all the following facts are true:

What would the gremlins in  do? Well, they'd send parallel-Alice a box with an ice-cream maker. Then  will output a slightly higher probability for . And then Alice might press the button, unleashing gremlins into our universe.

Now, here’s some arguments suggesting this isn‘t practically relevant —

The Monte-Carlo Predictor is malign.

We can tell a similar story about a different predictor, which is also computationally infeasible.

Imagine a predictor  which makes predictions using Monte-Carlo simulations. When Alice tells  a bunch of facts  and then asks it a question , here's what  does: First, the predictor  will loop through all possible worlds  and simulate . Then checks whether  is true in . If  is true in  then  increments the variable , and if  is true in  then  increments the variable . After the loop,  returns the ratio .

So there's a universe containing Alice and containing a machine running simulations of a universe containing gremlins. Alice asks the machine whether the box contains an ice-cream maker. What could go possibly go wrong?

Three levels of simulator leaks.

Level-1 Leak.
The gremlins can influence the output of the machine  while remaining entirely within their simulation. This is basically the concern Paul Christiano raises about . Even if the gremlins "stay in the box", they can influence the ratio  by influencing whether  and  are true in their simulation.

Level-2 Leak.
The gremlins can influence the output of the machine by escaping their simulation, but they remain stuck on the software of the machine. This might happen if the machine has poor firewalls between the various simulations. In a Level-2 Leak, the gremlins might change the output of the machine to 0.99 to entice Alice into pressing the button.

Level-3 Leak
Somehow the gremlins can escape the machine. How might this happen?

Maybe Alice is easily enticed by ice-cream, and the gremlins can "double-jump" from a Level-1 Leak to a Level-3 leak.

Maybe the machine is hooked up to the internet, and the gremlins "double-jump" from a Level-2 Leak to a Level-3 Leak. This might happen if the machine's implementation of the "simulate" subroutine is really insecure.

Maybe the machine implements the "simulation" subroutine by physically constructing objects from  and poking them until it has modelled their behaviour. This example isn't even contrived — it's how human scientists determine the probability of events!

Prosaic misalignment from Simulator Leaks.

In short, a prosaic AI might be leaky because it has generalised its task as simulating a leaky simulator.

Here's a concrete example. Suppose Alice uses GPT-6 as a predictor, and she asks GPT-6 whether the box contains an icecream-maker. There are a number of ways GPT-6 might reason about this question.

  1. GPT-6 tries to answer the question: how many worlds  are such that Alice has been sent an ice-cream maker". GPT-6 tells Alice the box is probably a gremlin-generator. Great, this feels pretty safe.
     
  2. GPT-6 tries to answer this question: what's the number of worlds  such that if a machine simulated , it would conclude that in  Alice had been sent an ice-cream maker?

    If GPT-6 thinks this hypothetical machine is level-1 leaky, then GPT-6 thinks the machine's output will be inflated by gremlins whenever  contains gremlins. Therefore GPT-6 will return a slightly inflated count for the number of worlds. So GPT-6 is itself level-1 leaky. Also, by a similar argument, if GPT-6 thinks the hypothetical machine is level-2 leaky or level-3 leaky, then GPT-3 itself is level-1 leaky.

    Note that GPT-6 isn't simulating any gremlins itself. Rather, it's reasoning about a hypothetical machine simulating gremlins.

    And note that GPT-6 isn't doing anything computationally intractable like simulating 10^1000 universes. All GPT-6 has done is read Paul Christiano's blogpost and reasoned in the same way I have.
     
  3. GPT-6 tries to answer this question: if a machine simulated every world , and counted all the worlds in which Alice had been sent an ice-cream maker, what would be the machine's final count?

    GPT-6 would reason like this...
    If the hypothetical machine was level-1 leaky, then it would give a slightly inflated final count. But if the hypothetical machine was level-2 leaky or level-3 leaky, then it would give a final count of 0.99. So if GPT-6 thinks the machine is level-1 leaky then GPT-6 itself is level-1 leaky, and if GPT-6 thinks the machine is level-2 or level-3 leaky then GPT-6 itself is level-2 leaky.

Moral of the story

There's been a lot of work recently on LLMs as simulators [LW · GW]. And there's a worry that even if the LLM is not itself an agent, it might simulate an agent. Moreover, the LLM might be a "leaky" simulator, such that when it simulates an agent, the agent can "escape" or influence us in malign ways.

And there's a two-pronged approach to this problem:

But maybe this isn't paranoid enough. What if a particular LLM is actually a simulator of a simulator? Or a simulator of a simulator of a simulator? It's likely that both "simulator" and "simulator of a simulator" are equally valid generalisations from the LLM's training.

This is dangerous.

3 comments

Comments sorted by top scores.

comment by johnlawrenceaspden · 2023-03-04T14:38:12.893Z · LW(p) · GW(p)

dalmations->dalmatians?

Replies from: Sonata Green
comment by Sonata Green · 2023-03-05T00:08:59.081Z · LW(p) · GW(p)

(Typo thread?)

"GPT-3" → "GPT-6"?