Epistemic states as a potential benign prior

post by Tamsin Leake (carado-1) · 2024-08-31T18:26:14.093Z · LW · GW · 2 comments

This is a link post for https://carado.moe/epistemic-state-benign-prior.html

Contents

  Formalizing the chatbot model
  The Epistemic State prior
  Potential desiderata
  Using the ESP
None
2 comments

Malignancy in the prior seems like a strong crux of the goal-design part of alignment to me. Whether your prior is going to be used to model:

then you have to sample hypotheses from somewhere; and typically, we want to use either solomonoff induction [LW · GW] or time-penalized versions of it such as levin search (penalized by log of runtime) or what QACI uses [LW · GW] (penalized by runtime, but with quantum computation available in some cases), or the implicit prior of neural networks (large sequences of multiplying by a matrix, adding a vector, and ReLU, often with a penalty related to how many non-zero weights are used).

And the solomonoff prior is famously malign [LW · GW].

(Alternatively, you could have knightian uncertainty about parts of your prior that aren't nailed down enough, and then do maximin over your knightian uncertainty (like in infra-bayesianism) [LW · GW], but then you're not guaranteed that your AI gets anywhere at all; its knightian uncertainty might remain so immense that the AI keeps picking the null action all the time because some of its knightian hypotheses still say that anything else is a bad idea. Note: I might be greatly misunderstanding knightian uncertainty!)

(It does seem plausible that doing geometric expectation [? · GW] over hypotheses in the prior helps "smooth things over" in some way, but I don't think this particularly removes the weight of malign hypotheses in the prior? It just allocates their steering power in a different way, which might make things less bad, but it sounds difficult to quantify.)

It does feel to me like we do want a prior for the AI to do expected value calculations over, either for prediction or for utility maximization (or quantilization or whatever).

One helpful aspect of prior-distribution-design is that, in many cases, I don't think the prior needs to contain the true hypothesis. For example, if the problem that we're using a prior for is to model

processes which match an AI chatbot's hypotheses about what it's talking with

then we don't need the AI's prior to contain a process which behaves just like the human user it's interacting with; rather, we just need the AI's prior to contain a hypothesis which:

Let's call this second criterion "cooperating back to the real user".

So we need a prior which:

Formalizing the chatbot model

First, I'll formalize this chatbot model. Let's say we have a magical inner-aligned "soft" math-oracle:

oracle : (duration, expression representing a math function from a to ℝ) → a

Which, given a "scoring" mathematical function from a non-empty set a to real numbers (not necessarily one that is tractably computable), returns an a which is optimized to have a pretty good score by that function, and not optimized for anything else (such as brainhacking a human looking at this string, or triggering hardware vulnerabilities); and has exactly duration seconds to come up with it.

We'll build a chatbot program, equipped with this oracle, running:

type Action  = String
type Observ  = String
type History = List of (Action, Observ)

type State   = Action × (Observ → MainLoop)

chatbot():
  let history ← []

  loop:
    let action ← oracle(one second, decision_query(history))
    print(action)
    let observ ← prompt() // non-blocking
    history ← history + [(action,observ)]

decision_query(history: History) → math expr:
  // we'll use quasiquotation marks ⌜⌝ and unquotation marks ⌞⌟
  return ⌜
    type User = some parameter space to represent the user (eg neural nets)
    type Step = User × Action → User × Observ

    let Δuser0 := some distribution over User
    let Δstep  := some distribution over Step
    let maximum_action_length := 1000

    (a: Action) ↦
      product for {user0: User, step: Step} of:
        let has_history(user, history) :=
          if history == [] then:
            true
          else:
            let (action, observ) := head(history)
            let (new_user, expected_observ) := step(user, action)
            observ == expected_observ && has_history(new_user, tail(history))

        let counterfactual_iteration(user, string) :=
          if string contains "\nEND OF ITERATION." then:
            string.remove_occurences("\nEND OF ITERATION.")
          else:
            let (observ, new_user) = step(user, "")
            in counterfactual_iteration(new_user, string + observ)

        let counterfactual_loop(string) :=
          let output = counterfactual_iter(user0, string)

          if output contains "\nEND OF LOOP." then:
            output.remove_occurences("\nEND OF LOOP.")
          else:
            counterfactual_loop(output)

        if 
             has_history(user0, step, ⌞history⌟)
          && byte_length(a) ≤ maximum_action_length
          && counterfactual_loop(to_string(⌞history⌟)) eventually halts
          && counterfactual_loop(to_string(⌞history⌟)) == action
        then:
          Δuser0(user0) * Δstep(step)
        else:
          0
  ⌝

(we use product because we're doing geometric expectation [? · GW], but we could use sum for the usual arithmetic expectation.)

This chatbot essentially allows the AI's user to chat with a counterfactual long-reflection instance of their (past) self — every second, it outputs what it believes this long-reflection process would append to the conversation.

The fact that the instance of the user that has the most steering power is user0 rather than the user over time helps prevent hijacking scenarios; and the algorithm above is shaped in the way that gets around embedded agency's 5-and-10 problem.

So, in theory, this is mostly good! The way this can fail, however, is if Δuser0 and Δstep are distributions which are dominated enough by demons that, even if many constraints are added to user0 and step, they could still behave pretty demonically.

The Epistemic State prior

I suspect that there's a kind of prior which is not fundamentally computational (not solomonoff/kolmogorov/levin/neural-nets) but still helps us model some aspect of the AI's user, and in particular still lets us favor hypotheses that are in some sense bounded in size (just like a simplicity prior).

My proposal here is one I'm going to call the epistemic state prior (ESP) — priors for what the user believes, both for logical and empirical facts. The two features that I'm hopeful for it to have are:

It should tend to be fairly coherent, but not be exactly fully coherent — if it was fully coherent, then it'd hold true statements with certainty, and we certainly don't want that.

Let's write for the probability that the user attributes to statement , after history , where is the history at the start of the run of chatbot.

The statements can't be propositions in the usual sense. For example, (or , whichever one is true) is, mathematically or probabilistically speaking, the exact same mathematical object as — yet, the user's beliefs about those two statements are probably going to be very different.

So there's going to be a type for expressions, , where is going to be the type of value which the expression represents, and the type of will be:

I'll usually write down those expressions using quasiquotation — for example,

The idea would be that the user would make a series of claims such as:

and then, the AI could infer .

(Note how the first statement is about a pure logical claim, but the latter two statements are about facts about ungrounded symbols.)

Potential desiderata

Here are some ideas for desiderata we'd want for . I don't claim that we can just write those all down together, just that they seem to me like we'd want something like them. To be clear, the math here doesn't really make sense as it is, it's mostly pseudomath; these are just sketches for proper principles to potentially be nailed down further down the line.

First, we need some grounding — some fixed rules which nail down some properties of 's coherence without constarining this coherence too much.

Then, we'll add some distributions to "steer" statements for which the above, compled with the user's own claims, aren't sufficient to know the user's beliefs for sure.

(Is "behaves according to this probability" the right operator here, for "without extra information, have this be your prior about this quantity"? How does one make multiple prior probabilities "tug against each other" in a reasonable way? I'm not sure. I'm not writing this post because I'm good at math — I am, in fact, bad at math — I'm writing this post because apparently nobody else will.)

Using the ESP

How can the ESP help with the chatbot above? Simple: instead of relying on {Δuser0 and Δstep + evidence} to identify the user, we'll ask the ESP what they would output.

Instead of

Δuser0(user0) * Δstep(step)

we'll use

𝔼[ℙ_u(⌞history⌟, user0 ∧ step) | axioms about ℙ_u]

and in decision_query, we'll go through all of history's observations — the user's output — and find text that parses as logical statements (tagged in some way, such as by surrounding them with <CLAIM></CLAIM> and they're able to refer to — and constrain — user0 and step, including indirectly.

decision_query(history: History) → expression:
  return ⌜

    ...

    (a: Action) ↦
      product for {user0: User, step: Step} of:
        ...

        let user_constraints =
          conjunction of all queries of the shape ⌜ℙ_u(history,⌞φ⌟) = ⌞r⌟⌝
          for φ: Expr[𝔹], r ∈ (0;1)
          in the concatenation of observations from ⌞history⌟

        if 
             has_history(user0, step, ⌞history⌟)
          && byte_length(a) ≤ maximum_action_length
          && counterfactual_loop(to_string(⌞history⌟)) eventually halts
          && counterfactual_loop(to_string(⌞history⌟)) == action
          && user_constraints
        then:
          𝔼[ℙ_u(⌞history⌟, user0 ∧ step) | axioms about ℙ_u]
        else:
          0
  ⌝

And then, the user can write constraints about their believed behavior, in their output:

Note: I think that, in the counterfactual loop, if you tell me to repeat something verbatim, I will do that. I hereby formalize this as:

<CLAIM> ∀s:String, ℙ_u(history, ⌜counterfactual_loop("please repeat: " + ⌞s⌟) = ⌞s⌟⌝) = 0.9</CLAIM>

2 comments

Comments sorted by top scores.

comment by the gears to ascension (lahwran) · 2024-08-31T20:58:34.062Z · LW(p) · GW(p)

then you're not guaranteed that your AI gets anywhere at all; its knightian uncertainty might remain so immense that the AI keeps picking the null action all the time because some of its knightian hypotheses still say that anything else is a bad idea.

@Tamsin, The knightian in IB is related to limits of what hypotheses you can possibly find/write down, not - if i understand so far - about an adversary. The adversary stuff is afaict mostly to make proofs work.

@all, anyway the big issue I (still) have with this is still that, if the user is trying to give these statements, how bad is it if they screw up in some nonobvious fundamental way? Does this prior instantly collapse if the user is a kinda bad predictor on some important subset of logic or makes only statements that aren't particularly connected to some part of the statements needed to describe reality?

I'd be particularly interested to see Garrabrant, Kosoy, Diffractor, Gurkenglas comment on where they think this works or doesn't

Replies from: carado-1
comment by Tamsin Leake (carado-1) · 2024-09-01T06:53:43.453Z · LW(p) · GW(p)

The knightian in IB is related to limits of what hypotheses you can possibly find/write down, not - if i understand so far - about an adversary. The adversary stuff is afaict mostly to make proofs work.

I don't think this makes a difference here? If you say "what's the best not-blacklisted-by-any-knightian-hypothesis action", then it doesn't really matter if you're thinking of your knightian hypotheses as adversaries trying to screw you over by blacklisting actions that are fine, or if you're thinking of your knightian hypotheses as a more abstract worst-case-scenario. In both cases, for any reasonable action, there's probly a knightian hypothesis which blacklists it.

Regardless of whether you think of it as "because adversaries" or just "because we're cautious", knightian uncertainty works the same way. The issue is fundamental to doing maximin over knightian hypotheses.