Epistemic states as a potential benign prior
post by Tamsin Leake (carado1) · 20240831T18:26:14.093Z · LW · GW · 2 commentsThis is a link post for https://carado.moe/epistemicstatebenignprior.html
Contents
Formalizing the chatbot model The Epistemic State prior Potential desiderata Using the ESP None 2 comments
Malignancy in the prior seems like a strong crux of the goaldesign part of alignment to me. Whether your prior is going to be used to model:
 processes in the multiverse containing a specific "beacon" bitstring [LW · GW],
 processes in the multiverse containing the AI which does said modeling,
 processes which would output all of some blog (eg mine) so we can make it output more of it,
 processes which match an AI chatbot's hypotheses about who/what it's talking with,
then you have to sample hypotheses from somewhere; and typically, we want to use either solomonoff induction [LW · GW] or timepenalized versions of it such as levin search (penalized by log of runtime) or what QACI uses [LW · GW] (penalized by runtime, but with quantum computation available in some cases), or the implicit prior of neural networks (large sequences of multiplying by a matrix, adding a vector, and ReLU, often with a penalty related to how many nonzero weights are used).
And the solomonoff prior is famously malign [LW · GW].
(Alternatively, you could have knightian uncertainty about parts of your prior that aren't nailed down enough, and then do maximin over your knightian uncertainty (like in infrabayesianism) [LW · GW], but then you're not guaranteed that your AI gets anywhere at all; its knightian uncertainty might remain so immense that the AI keeps picking the null action all the time because some of its knightian hypotheses still say that anything else is a bad idea. Note: I might be greatly misunderstanding knightian uncertainty!)
(It does seem plausible that doing geometric expectation [? · GW] over hypotheses in the prior helps "smooth things over" in some way, but I don't think this particularly removes the weight of malign hypotheses in the prior? It just allocates their steering power in a different way, which might make things less bad, but it sounds difficult to quantify.)
It does feel to me like we do want a prior for the AI to do expected value calculations over, either for prediction or for utility maximization (or quantilization or whatever).
One helpful aspect of priordistributiondesign is that, in many cases, I don't think the prior needs to contain the true hypothesis. For example, if the problem that we're using a prior for is to model
processes which match an AI chatbot's hypotheses about what it's talking with
then we don't need the AI's prior to contain a process which behaves just like the human user it's interacting with; rather, we just need the AI's prior to contain a hypothesis which:
 is accurate enough to match observations.
 is accurate enough to capture the fact that the user (if we pick a good user) implements the kind of decision theory that lets us rely on them pointing back to the actual real physical user when they get empowered — i.e. in CEV(userhypothesis), userhypothesis builds and then runs CEV(physicaluser), because that's what the user would do in such a situation.
Let's call this second criterion "cooperating back to the real user".
So we need a prior which:

Has at least some mass on hypotheses which
 correspond to observations
 cooperate back to the real user
 and can eventually be found by the AI, given enough evidence (enough chatting with the user)
Call this the "aligned hypothesis".

Before it narrows down hypothesis space to mostly just aligned hypotheses, doesn't give enough weight to demonic hypothesis which output whichever predictions cause the AI to brainhack its physical user, or escape using rowhammertype hardware vulnerabilities, or other failures like that.
Formalizing the chatbot model
First, I'll formalize this chatbot model. Let's say we have a magical inneraligned "soft" mathoracle:
oracle : (duration, expression representing a math function from a to ℝ) → a
Which, given a "scoring" mathematical function from a nonempty set a
to real numbers (not necessarily one that is tractably computable), returns an a
which is optimized to have a pretty good score by that function, and not optimized for anything else (such as brainhacking a human looking at this string, or triggering hardware vulnerabilities); and has exactly duration
seconds to come up with it.
We'll build a chatbot
program, equipped with this oracle, running:
type Action = String
type Observ = String
type History = List of (Action, Observ)
type State = Action × (Observ → MainLoop)
chatbot():
let history ← []
loop:
let action ← oracle(one second, decision_query(history))
print(action)
let observ ← prompt() // nonblocking
history ← history + [(action,observ)]
decision_query(history: History) → math expr:
// we'll use quasiquotation marks ⌜⌝ and unquotation marks ⌞⌟
return ⌜
type User = some parameter space to represent the user (eg neural nets)
type Step = User × Action → User × Observ
let Δuser0 := some distribution over User
let Δstep := some distribution over Step
let maximum_action_length := 1000
(a: Action) ↦
product for {user0: User, step: Step} of:
let has_history(user, history) :=
if history == [] then:
true
else:
let (action, observ) := head(history)
let (new_user, expected_observ) := step(user, action)
observ == expected_observ && has_history(new_user, tail(history))
let counterfactual_iteration(user, string) :=
if string contains "\nEND OF ITERATION." then:
string.remove_occurences("\nEND OF ITERATION.")
else:
let (observ, new_user) = step(user, "")
in counterfactual_iteration(new_user, string + observ)
let counterfactual_loop(string) :=
let output = counterfactual_iter(user0, string)
if output contains "\nEND OF LOOP." then:
output.remove_occurences("\nEND OF LOOP.")
else:
counterfactual_loop(output)
if
has_history(user0, step, ⌞history⌟)
&& byte_length(a) ≤ maximum_action_length
&& counterfactual_loop(to_string(⌞history⌟)) eventually halts
&& counterfactual_loop(to_string(⌞history⌟)) == action
then:
Δuser0(user0) * Δstep(step)
else:
0
⌝
(we use product
because we're doing geometric expectation [? · GW], but we could use sum
for the usual arithmetic expectation.)
This chatbot essentially allows the AI's user to chat with a counterfactual longreflection instance of their (past) self — every second, it outputs what it believes this longreflection process would append to the conversation.
The fact that the instance of the user that has the most steering power is user0
rather than the user over time helps prevent hijacking scenarios; and the algorithm above is shaped in the way that gets around embedded agency's 5and10 problem.
So, in theory, this is mostly good! The way this can fail, however, is if Δuser0
and Δstep
are distributions which are dominated enough by demons that, even if many constraints are added to user0
and step
, they could still behave pretty demonically.
The Epistemic State prior
I suspect that there's a kind of prior which is not fundamentally computational (not solomonoff/kolmogorov/levin/neuralnets) but still helps us model some aspect of the AI's user, and in particular still lets us favor hypotheses that are in some sense bounded in size (just like a simplicity prior).
My proposal here is one I'm going to call the epistemic state prior (ESP) — priors for what the user believes, both for logical and empirical facts. The two features that I'm hopeful for it to have are:
 It's powerful enough to "point back" to something we want to model, but in an aligned way.
 Because all that it models the user's beliefs, demons can't pretend to be the user "until it's time to strike", because that'd hopefully be a much weirder (prettycertainly more complex) epistemic state than just the user has been sincerely saying their actual beliefs the whole time.
It should tend to be fairly coherent, but not be exactly fully coherent — if it was fully coherent, then it'd hold true statements with certainty, and we certainly don't want that.
Let's write for the probability that the user attributes to statement , after history , where is the history at the start of the run of chatbot
.
The statements can't be propositions in the usual sense. For example, (or , whichever one is true) is, mathematically or probabilistically speaking, the exact same mathematical object as — yet, the user's beliefs about those two statements are probably going to be very different.
So there's going to be a type for expressions, , where is going to be the type of value which the expression represents, and the type of will be:
I'll usually write down those expressions using quasiquotation — for example,
 ,
 ,
 , which is equivalent to (but different from ).
The idea would be that the user would make a series of claims such as:
and then, the AI could infer .
(Note how the first statement is about a pure logical claim, but the latter two statements are about facts about ungrounded symbols.)
Potential desiderata
Here are some ideas for desiderata we'd want for . I don't claim that we can just write those all down together, just that they seem to me like we'd want something like them. To be clear, the math here doesn't really make sense as it is, it's mostly pseudomath; these are just sketches for proper principles to potentially be nailed down further down the line.
First, we need some grounding — some fixed rules which nail down some properties of 's coherence without constarining this coherence too much.

— the user's beliefs are exact for true and false.

(and perhaps others such as the definition of as , the communitativity of and , and more. But not the distributivity of over , because that'd closer to the kind of actuallyinferringstuff mechanism that I want the user to not be assumed to be perfectly coherent about.)
— the user's beliefs are perfectly coherent with regards to some simple transformation rules.
Then, we'll add some distributions to "steer" statements for which the above, compled with the user's own claims, aren't sufficient to know the user's beliefs for sure.

where is some symmetric normal distribution with median — by default, for any proposition, assume that the user has uncertainty about it.

where is some normal distribution with median — by default, the user's beliefs about statements are consistent over time. This is what constraints beliefs towards stability over time.
We could add , which constrains that when the change isn't near 0, then at least the degree of change must be near 0 (which, given that the function is bounded between 0 and 1, probably results in sigmoidish update curves over time).

where is a symmetric normal distribution with median — by default,
This is inspired by from MIRI truthiness (page 3), intended to pull the user's belief states towards coherence over logical consistency.
Perhaps this needs to be weighed by how similar and are (perhaps in terms of mutual information, from information theory?), or how simple is? I'm not sure.

Alternatively, another way to pull the user's beliefs towards logical consistency could be to have a set of rules of inference — perhaps a distribution, which the AI can update over, over all consistent inference rules, and then we can define a notion of inferential distance from an epistemic state to a given statement.
For example, if there are inference rules such that and , then with . If it's known that there's no shorter inference path from to , then we know that .
That said, having the axioms of logic along with the MIRItruthiness rule above might be sufficient for logical constincency, without the need for inference rules — or the inference rules could be applied in a way related to the MIRItruthiness rule above, somehow.
Also, this is starting to get dangerously close to the kind of computational prior we're trying to avoid — isn't {consistent inference rules, weighed by complexity} just {halting programs, weighed by code length} with extra steps?

Finally, we get to write some "pullingdistributions" for quantities about the user which we know about.
First, we'll define as the event (in the probability theory sense) of learning all of the actual for a given , and as a shorthand for the logical conjunction of all other axioms we'll have listed about .

where is the informationtheoritic notion of information content, and is some normal distribution whose median is the number of bits of information which we estimate the user's epistemic state contains. Alternatively, we could have this distribution have its highest point at 0 for a pure simplicity prior.
This puts a prior on the size of the epistemic state, written as: given the axioms written down so far, how much of an update would it be to suddenly know all of the actual for all ? It would be an update of a number of bits which we expect to be sampled from .

where is some normal distribution whose median is the number of bits of information which we estimate the user's epistemic state updates by every second.
This puts a prior on, given full knowledge of the user's epistemic state at a certain time, how much of a surprise it is to learn the new epistemic state one second later (history grows by one {action and observation} per second).
The ability to constrain informationtheoritic quantities like this, without depending on algorithmic information theory priors such as "information content should be as close to 0 as possible" or "information content should be as close to N as possible".

(Is "behaves according to this probability" the right operator here, for "without extra information, have this be your prior about this quantity"? How does one make multiple prior probabilities "tug against each other" in a reasonable way? I'm not sure. I'm not writing this post because I'm good at math — I am, in fact, bad at math — I'm writing this post because apparently nobody else will.)
Using the ESP
How can the ESP help with the chatbot
above? Simple: instead of relying on {Δuser0
and Δstep
+ evidence} to identify the user, we'll ask the ESP what they would output.
Instead of
Δuser0(user0) * Δstep(step)
we'll use
𝔼[ℙ_u(⌞history⌟, user0 ∧ step)  axioms about ℙ_u]
and in decision_query
, we'll go through all of history
's observations
— the user's output — and find text that parses as logical statements (tagged in some way, such as by surrounding them with <CLAIM>
…</CLAIM>
and they're able to refer to — and constrain — user0
and step
, including indirectly.
decision_query(history: History) → expression:
return ⌜
...
(a: Action) ↦
product for {user0: User, step: Step} of:
...
let user_constraints =
conjunction of all queries of the shape ⌜ℙ_u(history,⌞φ⌟) = ⌞r⌟⌝
for φ: Expr[𝔹], r ∈ (0;1)
in the concatenation of observations from ⌞history⌟
if
has_history(user0, step, ⌞history⌟)
&& byte_length(a) ≤ maximum_action_length
&& counterfactual_loop(to_string(⌞history⌟)) eventually halts
&& counterfactual_loop(to_string(⌞history⌟)) == action
&& user_constraints
then:
𝔼[ℙ_u(⌞history⌟, user0 ∧ step)  axioms about ℙ_u]
else:
0
⌝
And then, the user can write constraints about their believed behavior, in their output:
Note: I think that, in the counterfactual loop, if you tell me to repeat something verbatim, I will do that. I hereby formalize this as:
<CLAIM> ∀s:String, ℙ_u(history, ⌜counterfactual_loop("please repeat: " + ⌞s⌟) = ⌞s⌟⌝) = 0.9</CLAIM>
2 comments
Comments sorted by top scores.
comment by the gears to ascension (lahwran) · 20240831T20:58:34.062Z · LW(p) · GW(p)
then you're not guaranteed that your AI gets anywhere at all; its knightian uncertainty might remain so immense that the AI keeps picking the null action all the time because some of its knightian hypotheses still say that anything else is a bad idea.
@Tamsin, The knightian in IB is related to limits of what hypotheses you can possibly find/write down, not  if i understand so far  about an adversary. The adversary stuff is afaict mostly to make proofs work.
@all, anyway the big issue I (still) have with this is still that, if the user is trying to give these statements, how bad is it if they screw up in some nonobvious fundamental way? Does this prior instantly collapse if the user is a kinda bad predictor on some important subset of logic or makes only statements that aren't particularly connected to some part of the statements needed to describe reality?
I'd be particularly interested to see Garrabrant, Kosoy, Diffractor, Gurkenglas comment on where they think this works or doesn't
Replies from: carado1↑ comment by Tamsin Leake (carado1) · 20240901T06:53:43.453Z · LW(p) · GW(p)
The knightian in IB is related to limits of what hypotheses you can possibly find/write down, not  if i understand so far  about an adversary. The adversary stuff is afaict mostly to make proofs work.
I don't think this makes a difference here? If you say "what's the best notblacklistedbyanyknightianhypothesis action", then it doesn't really matter if you're thinking of your knightian hypotheses as adversaries trying to screw you over by blacklisting actions that are fine, or if you're thinking of your knightian hypotheses as a more abstract worstcasescenario. In both cases, for any reasonable action, there's probly a knightian hypothesis which blacklists it.
Regardless of whether you think of it as "because adversaries" or just "because we're cautious", knightian uncertainty works the same way. The issue is fundamental to doing maximin over knightian hypotheses.