# Reflection in Probabilistic Logic

post by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2013-03-24T16:37:36.197Z · score: 69 (67 votes) · LW · GW · Legacy · 171 commentsPaul Christiano has devised **a new fundamental approach** to the "Löb Problem" wherein Löb's Theorem seems to pose an obstacle to AIs building successor AIs, or adopting successor versions of their own code, that trust the same amount of mathematics as the original. (I am currently writing up a more thorough description of the *question *this preliminary technical report is working on answering. For now the main online description is in a quick Summit talk I gave. See also Benja Fallenstein's description of the problem in the course of presenting a different angle of attack. Roughly the problem is that mathematical systems can only prove the soundness of, aka 'trust', weaker mathematical systems. If you try to write out an exact description of how AIs would build their successors or successor versions of their code in the most obvious way, it looks like the mathematical strength of the proof system would tend to be stepped down each time, which is undesirable.)

Paul Christiano's approach is inspired by the idea that whereof one cannot prove or disprove, thereof one must assign probabilities: and that although no mathematical system can contain its own *truth* predicate, a mathematical system might be able to contain a reflectively consistent *probability* predicate. In particular, it looks like we can have:

∀a, b: (a < P(φ) < b) ⇒ P(a < P('φ') < b) = 1

∀a, b: P(a ≤ P('φ') ≤ b) > 0 ⇒ a ≤ P(φ) ≤ b

Suppose I present you with the human and probabilistic version of a Gödel sentence, the Whitely sentence "You assign this statement a probability less than 30%." If you disbelieve this statement, it is true. If you believe it, it is false. If you assign 30% probability to it, it is false. If you assign 29% probability to it, it is true.

Paul's approach resolves this problem by restricting your belief about your own probability assignment to within epsilon of 30% for any epsilon. So Paul's approach replies, "Well, I assign *almost* exactly 30% probability to that statement - maybe a little more, maybe a little less - in fact I think there's about a 30% chance that I'm a tiny bit under 0.3 probability and a 70% chance that I'm a tiny bit over 0.3 probability." A standard fixed-point theorem then implies that a consistent assignment like this should exist. If asked if the probability is over 0.2999 or under 0.30001 you will reply with a definite yes.

We haven't yet worked out a walkthrough showing if/how this solves the Löb obstacle to self-modification, and the probabilistic theory itself is nonconstructive (we've shown that something like this should exist, but not how to compute it). Even so, a possible fundamental triumph over Tarski's theorem on the undefinability of truth and a number of standard Gödelian limitations is important news as math *qua* math, though work here is still in very preliminary stages. There are even whispers of unrestricted comprehension in a probabilistic version of set theory with ∀φ: ∃S: P(x ∈ S) = P(φ(x)), though this part is not in the preliminary report and is at even earlier stages and could easily not work out at all.

It seems important to remark on how this result was developed: Paul Christiano showed up with the idea (of consistent probabilistic reflection via a fixed-point theorem) to a week-long "math squad" (aka MIRI Workshop) with Marcello Herreshoff, Mihaly Barasz, and myself; then we all spent the next week proving that version after version of Paul's idea couldn't work or wouldn't yield self-modifying AI; until finally, a day after the workshop was supposed to end, it produced something that looked like it might work. If we hadn't been trying to *solve *this problem (with hope stemming from how it seemed like the sort of thing a reflective rational agent ought to be able to do somehow), this would be just another batch of impossibility results in the math literature. I remark on this because it may help demonstrate that Friendly AI is a productive approach to math *qua *math, which may aid some mathematician in becoming interested.

I further note that this does not mean the Löbian obstacle is resolved and no further work is required. Before we can conclude that we need a computably specified version of the theory plus a walkthrough for a self-modifying agent using it.

See also the blog post on the MIRI site (and subscribe to MIRI's newsletter here to keep abreast of research updates).

This LW post is the preferred place for feedback on the paper.

EDIT: But see discussion on a Google+ post by John Baez here. Also see here for how to display math LaTeX in comments.

## 171 comments

Comments sorted by top scores.

I'm *very* happy both to see this result and the fact that prominent mathematicians such as John Baez and Timothy Gowers seem to be taking an interest in it. It considerably increases my confidence in that

- Current MIRI-folk will be capable of producing original, valuable, solid math results
- By producing such results, they will become capable of attracting more talented mathematicians and producing even more such results

My confidence in MIRI being successful in its mission, while still pretty low, is now much higher than it was a few days ago.

John Baez has posted about the draft on Google+.

The main theorem of the paper (Theo. 2) does not seem to me to accomplish the goals stated in the introduction of the paper. I think that it might sneakily introduce a meta-language and that this is what "solves" the problem.

What I find unsatisfactory is that the assignment of probabilities to sentences is not shown to be definable in L. This might be too much to ask, but if nothing of the kind is required, the reflection principles lack teeth. In particular, I would guess that Theo. 2 as stated is trivial, in the sense that you can simply take to only have value 0 or 1. Note, that the third reflection principle imposes no constraint on the value of on sentences of L' \ L.

Your application of the diagonalisation argument to refute the reflection scheme (2) also seems suspect, since the scheme only quantifies over sentences of L and you apply it to a sentence G which might not be in L. To be exact, you do not claim that it refutes the scheme, only that it seems to refute it.

You are completely right, and thanks for the correction! A new version with this problem fixed should be up in a bit.

(The actual proof of theorem 2 establishes the correct thing, there is just a ' missing from the statement.)

Actually, I believe you've found a bug in the paper which everyone else seems to have missed so far, but fortunately it's just a typo in the statement of the theorem! The quantification should be over L', not over L, and the proof does prove this much stronger statement. The statement in the paper is indeed trivial for the reason you say.

Given the stronger statement, the reason you can't just have P have value 0 or 1 is sentences like G <=> P('G') < 0.5: if P(G) = 0, by reflection it would follow that P(G) = 1, and if P(G) = 1, then P(not G) = 0, and by reflection it would follow that P(not G) = 1.

The Kakutani fixed point theorem that this result relies on is Corollary 17.55 in Infinite Dimensional Analysis - A Hitchhiker's Guide by Aliprantis and Border.

Edit: pdf

Thanks Nisan!

**Some technical remarks on topological aspects of the paper:**

The notion of a "coherent probability distribution" and the use of the product topology on the set of all such distributions are a bit ideosyncratic. Here's a way to relate them to more standard notions: Consider a countable first-order language L and the Stone space of complete theories over it. Equip this with the Borel-σ-algebra. (1.) A coherent probability distribution is (equivalent to) a probability distribution on the resulting measure space. (2.) The (AFAICT) most commonly used topology on the space of all such distributions is the one of weak convergence of probability measures. It turns out that this is equivalent to the product topology on the coherent probability distributions.

(1.) The main unusual feature of coherent probability distributions is that they're only defined on the sentences of L, which don't form a σ-algebra. Their equivalence classes under logical equivalence form a Boolean algebra, though, which is in particular a ring of sets (in the measure-theoretic sense) of complete theories (we identify each equivalence class with the set of complete theories containing the sentences in the equivalence class). Furthermore, this ring generates our σ-algebra: by the definition of "Stone space", the sets in the ring form a base of the Stone space, and since this base is countable, every open set is a countable union of base sets (meaning that the smallest σ-algebra containing the open sets is also the smallest σ-algebra containing the base).

A coherent probability distribution, by the alternative characterization from the paper, is a finitely additive probability measure. But a finitely additive measure on a ring is already a premeasure (i.e., σ-additive on the ring) if for every descending sequence of elements of the ring, %20%3E%200) implies , and a premeasure on a ring extends uniquely to a measure on the generated σ-algebra. Now, by the assumption, we have %20%3E%200) and therefore for all ; since , this means that the family ) has the finite intersection property, and so since Stone spaces are compact and each is clopen, the intersection of all is non-empty as desired.

(2.) The Stone space of a Boolean algebra is metrizable if and only if the Boolean algebra is countable, so since we're interested in countable languages, the notion of weak convergence of probability measures on our space is well-defined. One of the equivalent definitions is that if \ge\mathbb{P}(B)) for all open sets . We want to show that this is equivalent to convergence in the product topology on coherent probability distributions, which amounts to pointwise convergence of ) to ) for all in the base (i.e., for for some sentence ).

Suppose first that \to\mathbb{P}(A)) for all base sets and let be an arbitrary open set. can be written as a countable union of base sets; since the base is closed under Boolean operations, it follows that it can be written as a countable disjoint union (let ). For any , there is an such that \le\mathbb{P}(\bigcup_{i=1}%5Em%20A_i)+\epsilon/2). By pointwise convergence, for sufficiently large we have \le\mathbb{P}_n(A_i)%20+%20\epsilon/2m) for all . Therefore,

\;\le\;\sum_{i=1}%5Em\mathbb{P}(A_i)+\epsilon/2\;\le\;\sum_{i=1}%5Em\mathbb{P}_n(A_i)%20+%20\epsilon\;\le\;\mathbb{P}_n(B)+\epsilon.)

Since this holds for all , the desired inequality follows.

Suppose now that \ge\mathbb{P}(B)) for all open sets . We must show that for all base sets , \to\mathbb{P}(A)). But base sets are clopen, so we have both \ge\mathbb{P}(A)) and

%20\;=\;%201-\liminf\mathbb{P}_n(A%5Ec)%20\;\le\;%201%20-%20\mathbb{P}(A%5Ec)%20\;=\;%20\mathbb{P}(A),)

implying %20=%20\mathbb{P}(A)).

Wow, this is great work--congratulations! If it pans out, it bridges a really fundamental gap.

I'm still digesting the idea, and perhaps I'm jumping the gun here, but I'm trying to envision a UDT (or TDT) agent using the sense of subjective probability you define. It seems to me that an agent can get into trouble even if its subjective probability meets the coherence criterion. If that's right, some additional criterion would have to be required. (Maybe that's what you already intend? Or maybe the following is just muddled.)

Let's try invoking a coherent P in the case of a simple decision problem for a UDT agent. First, define G <--> P("G") < 0.1. Then consider the 5&10 problem:

If the agent chooses A, payoff is 10 if ~G, 0 if G.

If the agent chooses B, payoff is 5.

And suppose the agent can prove the foregoing. Then unless I'm mistaken, there's a coherent P with the following assignments:

P(G) = 0.1

P(Agent()=A) = 0

P(Agent()=B) = 1

P(G | Agent()=B) = P(G) = 0.1

And P assigns 1 to each of the following:

P("Agent()=A") < epsilon

P("Agent()=B") > 1-epsilon

P("G & Agent()=B") / P("Agent()=B") = 0.1 +- epsilon

P("G & Agent()=A") / P("Agent()=A") > 0.5

The last inequality is consistent with the agent indeed choosing B, because the postulated conditional probability of G makes the expected payoff given A less than the payoff given B.

Is that P actually incoherent for reasons I'm overlooking? If not, then we'd need something beyond coherence to tell us which P a UDT agent should use, correct?

(edit: formatting)

I've also tried applying this theory to UDT, and have run into similar 5-and-10-ish problems (though I hadn't considered making the reward depend on a statement like G, that's a nice trick!). My tentative conclusion is that the reflection principle is too weak to have much teeth when considering a version of UDT based on conditional expected utility, because for all actions A that the agent *doesn't* take, we have P(Agent() = A) = 0; we might still have P("Agent() = A") > 0 (but smaller than epsilon), but the reflection axioms do not need to hold *conditional on Agent() = A*, i.e., for X a reflection axiom we can have P assign positive probability to e.g. P("X & Agent() = A") / P("Agent() = A") < 0.9.

But it's difficult to ask for more. In order to evaluate the expected utility conditional on choosing A, we need to coherently imagine a world in which the agent *would* choose A, and if we also asked the probability distribution conditional on choosing A to satisfy the reflection axioms, then choosing A would not be optimal conditional on choosing A -- contradiction to the agent choosing A... (We could have P("Agent() = A") = 0, but not if you have the agent playing chicken, i.e., play A if P("Agent() = A"); if we have such a chicken-playing agent, we can coherently imagine a world in which it would play A -- namely, a world in which P("Agent() = A") = 0 -- but this is a world that assigns probability zero to itself. To make this formal, replace "world" by "complete theory".)

I think applying this theory to UDT will need more insights. One thing to play with is a formalization of classical game theory:

- Specify a decision problem by a function from (a finite set of) possible actions to utilities. This function is allowed to be written in the full formal language containing P(".").
- Specify a universal agent which takes a decision problem D(.), evaluates the expected utility of every action -- not in the UDT way of conditioning on Agent(D) = A, but by simply evaluating the expectation of D(A) under P(".") -- and returns the action with the highest expected utility.
- Specify a game by a payoff function, which is a function from pure strategy profiles (which assign a pure strategy to every player) to utilities for every player.
- Given a game G(.), for every player, recursively define actions A_i := Agent(D_i) and decision problems D_i(a) := G_i(A_1, ..., A_(i-1), a, A_(i+1), ..., A_n), where G_i is the i'th component of G (i.e., the utility of player i).

Then, (A_1, ..., A_n) will be a Nash equilibrium of the game G. I believe it's also possible to show that for every Nash equilibrium, there is a P(.) satisfying reflection which makes the players play this NE, but I have yet to work carefully through the proof. (Of course we don't want to become classical economists who believe in defection on the one-shot prisoner's dilemma, but perhaps thinking about this a bit might help with finding an insight for making an interesting version of UDT work. It seems worth spending at least a bit of time on.)

It occurs to me that my references above to "coherence" should be replaced by "coherence & P(T)=1 & reflective consistency". That is, there exists (if I understand correctly) a P that has all three properties, and that assigns the probabilities listed above. Therefore, those three properties would not suffice to characterize a suitable P for a UDT agent. (Not that anyone has claimed otherwise.)

The informal idea seems very interesting, but I'm hesitant about the formalisation. The one-directional => in the reflection principle worries me. This "sort of thing" already seems possible with existing theories of truth. We can split the Tarski biconditional A<=>T("A") into the *quotation principle* A=>T("A") and the *dequotation principle* T("A")=>A. Many theories of truth keep one or the other. A quotational theory may be called a "glut theory": the sentences reguarded as true are a strict superset of those which are true, which means we have T("A") and T("~A") for some A. Disquotational theories, on the other hand, will be "gap theories": the sentences asserted true are a strict subset of those which are true, so that we have ~T("A") and ~T("~A") for some A. This classification oversimplifies the situation (it makes some unfounded assumptions), but...

The point is, the single-direction arrow makes this reflection principle look like a quotational ("glut") theory.

Usually, for a glut theory, we want a *restricted* disquotational principle to hold. The quotational principle already holds, so we might judge the success of the theory by how wide a class of sentences can be included in a disquotational principle. (Similarly, we might judge gap theories by how wide a class of sentences can be included in a quotational principle.) The paper doesn't establish anything like this, though, and that's a big problem.

The reflection principle a P('a<P("A")<b')=1 gives us very little, unless I'm missing something... it seems to me you need a corresponding disquotational principle to ensure that P('a<P("A")<b')=1 actually means something.

Is there anything blocking you from setting P("a<P('A')<b")=1 for all cases, and therefore satisfying the stated reflection principle trivially?

Is there anything blocking you from setting P("a<P('A')<b")=1 for all cases, and therefore satisfying the stated reflection principle trivially?

Yes, the theory is logically coherent, so it can't have P("a < P('A')") = 1 and P("a > P('A')") = 1.

For example, the following disquotational principle follows from the reflection principle (by taking contrapositives):

P( x <= P(A) <= y) > 0 ----> x <= P(A) <= y

The unsatisfying thing is that one direction has "<=" while the other has "<". But this corresponds to a situation where you can make arbitrarily precise statements about P, you just can't make exact statements. So you can say "P is within 0.00001 of 70%" but you can't say "P is exactly 70%."

I would prefer be able to make exact statements, but I'm pretty happy to accept this limitation, which seems modest in the scheme of things---after all, when I'm writing code I never count on exact comparisons of floats anyway!

This was intended as a preliminary technical report, and we'll include much more discussion of these philosophical issues in future write-ups (also much more mathematics).

Ok, I see!

To put it a different way (which may help people who had the same confusion I did):

We can't set all statements about probabilities to 1, because P is encoded as a function symbol within the language, so we can't make inconsistent statements about what value it takes on. ("Making a statement" simply means setting a value of P to 1.)

I'm very pleased with the simplicity of the paper; short is good in this case.

Actually, we can use coherence to derive a much more symmetric disquotation principle:

P(x>P(A)>y)=1 => x>P(A)>y.

Suppose P(x>P(A)>y)=1. For contradiction, suppose P(A) is outside this range. Then we would also have P(w>P(A)>z)=1 for some (w,z) mutually exclusive with (x,y), contradicting coherence.

Right?

Not quite---if P(A) = x or P(A) = y, then they aren't in any interval (w, z) which is non-overlapping (x, y).

We can obtain P(x > P(A) > y) =1 ---> x >= P(A) >= y by this argument. We can also obtain P(x >= P(A) >= y) > 0 ---> x >= P(A) >= y.

So, herein lies the "glut" of the theory: we will have more > statements than are strictly true. > will behave as >= should: if we see > as a conclusion in the system, we have to think >= with respect to the "true" P.

A "gap" theory of similar kind would instead report too few inequalities...

Yes, there is an infinitesimal glut/gap; similarly, the system reports fewer >= statements than are true. This seems like another way at looking at the trick that makes it work---if you have too many 'True' statements on both sides you have contradictions, if you have too few you have gaps, but if you have too many > statements and too few >= statements they can fit together right.

Ah, right, good!

For example, the following disquotational principle follows from the reflection principle (by taking contrapositives):

P( x <= P(A) <= y) > 0 ----> x <= P(A) <= y

The unsatisfying thing is that one direction has "<=" while the other has "<".

The negated statements become 'or', so we get x <= P(A) or P(A) <= y, right?

To me, the strangest thing about this is the >0 condition... if the probability of this type of statement is above 0, it is true!

I agree that the derivation of (4) from (3) in the paper is unclear. The negation of a=b>=c.

Ah, so there are already revisions... (I didn't have a (4) in the version I read).

John Baez just posted a blog article on the topic which is longer than his previous Google+ post.

Note that Paul keeps updating the draft, always at the same URL. That's convenient because links always point to the latest draft, but you may need to hit 'refresh' on the PDF's URL a couple times to get your browser to not just pull up the cached version. As of this moment, the current draft at that URL should say "revised March 26" at the top.

(...)you may need to hit 'refresh' on the PDF's URL a couple times to get your browser to not just pull up the cached version.

Tip: Most browsers have a "Force full refresh" option that forces clearing the cache and re-downloading all assets on the page, usually assigned the shortcut Ctrl+F5 by default.

I'm actually quite surprised. Didn't think I'd be saying that but looks like good job math wise (edit: obvious caveats apply: not my area of expertise, don't know if novel, don't know if there's any related proofs). (As far as x-risks go though, something not well designed for self improvement is a lot less scary).

It's not obvious to me that this work is relevant for AI safety research, but it seems mathematically interesting on it's own (though it's not my area of experise either, so I can't evaluate it).

I'm positively impressed. Honestly, I'd never thought that SI/MIRI would deliver anything scientifically noteworhy. I hope this is the beginning of a positive trend.

**Comments on the proof of Theorem 2:**

I believe the entire second-to-last paragraph can be replaced by:

If there is no such , then . Thus we can take to be the complement of (which is open since is closed) and let be the entire space.

I'm thinking that the proof becomes conceptually slightly clearer if you show that the graph is closed by showing that it contains the limit of any convergent sequence, though:

Suppose ) for all , and \to(\mathbb{P}',\mathbb{P})). Since is closed, we have . We must show that %20=%201). Thus, suppose that %3Cb). Since \to\mathbb{P}(\varphi)), for all sufficiently large we have %3Cb) and therefore %20%3C%20b)%20=%201). Since , it follows that %20%3C%20b)%20=%201). In other words, assigns probability 1 to every element of , and hence to all of (since this set is countable).

I think this is a cleaner way of saying it; I'm going to take this approach. Thanks!

One subtlety that has been bothering me: In the draft's statement of Theorem 2, we start with a language L and a consistent theory T over L, then go on considering the extension L' of L with a function symbol P(.). This means that P(.) cannot appear in any axiom schemas in T; e.g., if T is ZFC, we cannot use a condition containing P(.) in comprehension. This seems likely to produce subtle gaps in applications of this theory. In the case of ZFC, this *might* not be a problem since the function denoted by P(.) is representable by an ordinary set at least in the standard models, so the free parameters in the axiom schemas might save us -- though then again, this approach takes us beyond the standard models...

I think not having to think about such twiddly details might be worth making the paper a tad more complicated by allowing T to be in the extended language. I haven't worked through the details, but I *think* all that should change is that we need to make sure that A(P) is nonempty, and I *think* that to get this it should suffice to require that T is consistent with any possible reflection axiom a < P("phi") < b. [**ETA:** ...okay, I guess that's not enough: consider ((NOT phi) OR (NOT psi)), where phi and psi are reflection axioms. Might as well directly say "consistent with A_P for any coherent probability distribution P" then, I suppose.]

I think including a condition with P in comprehension will require more subtlety---for example, such an axiom would not be consistent with most coherent distributions P . If I include the axiom P(x in S) = P(x not in x), then it better be the case that P(S in S) = 1/2.

We can still do it, but I don't see how to factor out the work into this paper. If you see a useful way to generalize this result to include axioms of L', that would be cool; I don't see one immediately, which is why it is in this form.

Sorry, we're miscommunicating somewhere. What I'm saying is that e.g. given a set of statements, I want the axiom asserting the existence of the set %20%3E%201/2\}), i.e. the comprehension axiom applied to the condition %20%3E%201/2). I don't understand how this would lead to %20=%20\mathbb{P}(x%20\notin%20x)); could you explain? (It seems like you're talking about unrestricted comprehension of some sort; I'm just talking about allowing the condition in ordinary restricted comprehension to range over formulas in L'. Maybe the problem you have in mind only occurs in the unrestricted comprehension work which isn't in this draft?)

Consider my proposed condition that " is consistent with for any coherent distribution ". To see that this is true for ZFC in the language L', choose a standard model of ZFC in L and, for any function from the sentences of L' to , extend it to a model in L' by interpreting ) as ); unless I'm being stupid somehow, it's clear that the extended model will satisfy ZFC-in-L' + .

It seems to me that the only parts of the proof that need to be re-thought are the arguments that (a) and (b) ) are non-empty. Perhaps the easiest way to say the argument is that we extend (a) or (b) to some arbitrary complete theory , and set %20=%201) if and otherwise.

I understand what you are saying. You are completely right, thanks for the observation. I don't have time to muck with the paper now, but it looks like this would work.

Congrats to MIRI! Just some comments that I already made to Mihaly and Paul:

1) Agree with Abram that unidirectional implication is disappointing and the paper should mention the other direction.

2) The statement "But this leads directly to a contradiction" at the top of page 4 seems to be wrong, because the sentence P(G)<1 isn't of the form P(phi)=p required for the reflection principle.

3) It's not clear whether we could have bidirectional implication if we used closed intervals instead of open, maybe worth investigating further.

We tried closed intervals a fair bit, no luck so far.

Posted an edited version in response to some of these issues. Thanks again for the remarks!

(I feel less strongly about the bidirectional implication than you; as math I can see why you would definitely want it, but in practice I am nearly as happy to eat the epsilon of error.)

1) Agree with Abram that unidirectional implication is disappointing and the paper should mention the other direction.

I consider Paul's response to be sufficient.

Another dilettante question: Is there a mathematics of statements with truth values that change? I'm imagining "This statement is false" getting filed under "2-part cycle".

Yes. There is an overview of revision theories of truth (and many other approaches to defining truth) at the SEP.

I've never heard of it but my intuition says it doesn't sound very promising - any correspondence definition of "truth" that changes over time should reduce to a timeless time-indexed truth predicate and I don't see how that would help much, unless the truth predicate couldn't talk about the index.

The truth predicate could talk about the past (but not the present), but I believe that reduces down to other things that have been explored quite thoroughly (iirc, an idea of a hierarchy of math with higher levels being able to talk about lower levels but not vice versa was around for Bertrand Russel to endorse). And at that point "time" is just a label for what level of explanatory power you are talking about.

**[deleted]**· 2013-03-31T11:16:11.170Z · score: 1 (3 votes) · LW(p) · GW(p)

(This reminds me of when I short-circuited a NOT gate, curious to see what would happen and suspecting it might oscillate between the two values. Actually, it stabilized to a voltage somewhere between “0” and “1” and stayed there.)

If you posit that: "This statement is false" (i.e. the fixpoint μ *x*. ¬*x* ) oscillates or "switches" between TRUE and FALSE, then this is pretty much how digital logic works in electronics, because the NOT operation is implemented as a logic gate with a finite delay.

You could look up George Spencer Brown's "Laws of Form". It's...odd. (Poking this in through a phone otherwise would go into more detail.)

I read it long ago, but didn't understand the parts that looked like circuits. I should probably give it another try.

I think that Theo. 2 of your paper, in a sense, produces only non-standard models. Am I correct?

Here is my reasoning:

If we apply Theo. 2 of the paper to a theory T containing PA, we can find a sentence G such that "G <=> Pr("G")=0" is a consequence of T. ("Pr" is the probability predicate inside L'.)

If P(G) is greater than 1/n (I use P to speak about probability over the distribution over models of L'), using the reflection scheme, we get that with probability one, "Pr("G")>1/n" is true. This means that with probability one, G is false, a contradiction. We deduce that P(G)=0.

Using the reflection scheme and that, for all n, P(G)0, which as shown above is false.

From the last two sentences, with probability one, inside of the model, there is a strictly positive number smaller than 1/n for any standard integer n: the value of Pr("G").

It might be interesting to see if we can use this to show that Theo. 2 cannot be done constructively, for some meaning of "constructively".

This is right. (I think we point out in the paper that the model necessarily contains non-standard infinitesimals.) But an interesting question is: can this distribution assign any model a positive probability? I would guess not (e.g. because P("G") is uniformly random for some "G").

I agree with your guess and I think that I see a way of proving it.

Let us make the same assumptions as in Theo. 2 and assume that T contains PA.

Let G be any sentence of L'. We can build a second sentence G2 such that "G2 <=> ( G and (Pr(G and G2)< p))" for P(G)/3< p <2P(G)/3, using diagonalization. From this and the reflection scheme, it should be possible to prove that P(G and G2) and P(G and not G2) are both smaller than (2/3)P(G).

Repeating the argument above, we can show that any complete theory in L' must have vanishing probability and therefore every model must also have vanishing probability.

There may be an issue with the semantics here. I'm not entirely certain of the reasoning here, but here goes:

Reflection axiom: ∀a, b: (a < P(φ) < b) ⇒ P(a < P('φ') < b) = 1 (the version in Paul's paper).

Using diagonalisation, define G ⇔ -1<P('G')<1

Then P(G)<1

⇒ -1 0)

⇒ P(-1<P('G')<1)=1 (by reflection)

⇒ P(G)=1 (by definition of G).

Hence, by contradiction (and since P cannot be greater than 1), P(G)=1. Hence P('G')=1 (since the two P's are the same formal object), and hence G is false. So we have a false result that has probability 1.

But it may get worse. If we can prove P(G)=1 then (?) the logical system itself can prove P('G')=1 - and from that, can disprove G. So ¬G is a theorem of the system! And yet P(¬G)=0. So the system has theorems with probability zero...

*EDIT*: corrected the version of reflection to the version in Paul's paper (not the one in Eliezer's post) by removing two '.

P(G) is a real number in the interval [0, 1], so you shouldn't be *too* sad that P(P(G) = p) = 0 for each p. More generally, you shouldn't expect good behavior when P talks about exact statements---P can be mistaken about itself up to some infinitesimal epsilon, so it can be wrong about whether P(G) is exactly 1 or slightly less.

We can prove P(G) = 1, but the system can't. The reflection schema is not a theorem---each instance is true (and is assigned probability 1), and the universal quantification is true, but the universal quantification is not assigned probability 1. In fact, it is assigned probability 0.

Hum... So the system is unable to prove that -1 < P('G') < 1 implies P('-1 < P('G') < 1') (note where the ' are)? This is a (true) statement about the *function* P.

So, first of all, there isn't really a particular formal proof system in the paper, except sort of implicitly; all we get is a function P(.) from L'-sentences to real numbers. However, we can talk about the first-order theory you obtain by taking the base theory T and adding to it the reflection axioms, i.e. the axioms a < P('G') < b for all G such that the function P(.) satisfies a < P(G) < b. [For anyone not paying very close attention, P('.') is a function symbol in the formal language, while P(.) is a meta-language function from the formal language to the real numbers.] The main theorem is that P(.) assigns probability 1 to this theory (and therefore to all consequences of this theory), so it is interesting to ask what it entails.

Your question, if I understand correctly, is whether this theory does in particular entail

[-1 < P('G') < 1] --> [P('-1 < P('G') < 1') = 1]

(note that this is simply a sentence in the language L', not a meta-language statement). There's no obvious reason that the system should be able to prove this -- all we have is T and the individual axioms of the form [a < P('G') < b] --, and indeed your proof shows that (unless there is an inconsistency in the paper) this is *not* a theorem of the system: if it were, then so would be P('G') = 1, and therefore ¬G, leading to a contradiction since P(.) assigns probability 1 to all theorems of this system.

In short, yes, there are true statements about P(.) such that the corresponding statement about P('.') is *false* in the formal system, and assigned zero probability by P(.). And yes, I do believe this is worth keeping in mind.

Here is my current preferred way of thinking about this: In non-standard analysis, there is a "standard" operator st(.) which takes a non-standard real number x and returns the standard number st(x) that x is infinitesimally close to (assuming one exists, which is the case if x is not unbounded). This operator can also be applied to functions f; st(f)(x) is the standard number which f(x) is infinitesimally close to. I imagine that there is a collection of "possible" non-standard models, each of which fulfills all the axioms that P(.) assigns probability 1, and that one of these possible non-standard models is the *true* one describing the real world. Then, P(.) is st(P('.')), where P('.') is the function in the true non-standard model. [**ETA**: This is always defined because we have -1 < P('G') < 2, so every value of P('.') is infinitesimally close to a standard number.]

How could the real world be described by a non-standard model, you ask? The following is rather handwavy, but I'm imagining the agent as computing an infinite sequence of approximations P_i(.) to the intended distribution, and the true model to be the ultrapower of some standard model of ZFC, with P('.') being interpreted as P_i(.) in the i'th component of the ultrapower. Whether any leverage can be gotten out of making this idea formal, I don't know. (And it's certainly true that to pull off this intuition, I still need to have a free ultrafilter falling out of the sky.)

Ok, for [a < P('G') < b] I see why you'd use a schema (because it's an object level axiom, not a meta-language axiom).

But this still seems possibly problematic. We know that adding axioms like [-1 < P('G') < 1] --> [P('-1 < P('G') < 1') = 1], would break the system. But I don't see a reason to suppose that the other reflection axioms don't break the system. It might or it might not, but I'm not sure; was there a proof along the lines of "this system is inconsistent if and only if the initial system is" or something?

I'm not sure whether I'm misunderstanding your point, but the paper proves that there is a coherent probability distribution P(.) that assigns probability 1 to both T and to the collection of reflection axioms [a < P('G') < b] for P(.); this implies that there is a probability distribution over complete theories assigning probability 1 to (T + the reflection axioms for P). But if (T + the reflection axioms for P) were inconsistent, then there would be no complete theory extending it, so this would be the empty event and would have to be assigned probability 0 by any coherent probability distribution. It follows that (T + the reflection axioms for P) is consistent. (NB: by "the reflection axioms for P(.)", I only mean the appropriate instances of [a < P('G') < b], not anything that quantifies over a, b or G inside the object language.)

Incidentally, why bother with a schema? Reflection belongs in the meta language, so there's no need to use a schema (?)

I think the point is that reflection gives an *infinite* collection of formal statements all of which must be assigned probability 1 ("axioms"), in analogy to how an axiom schema is a specification of an infinite collection of axioms. Whether "schema" is the right word to use here is debatable, I guess; I was happy with the choice but perhaps you could argue that "schema" should be reserved to recursively enumerable collections or something.

My point was that reflection is defined for the meta language, and quantifies over all the sentences in the object language. This is not something that can be added as an axiom in the object language at all - P (without the ') has no meaning there.

So none of the sub-statements of reflection get assigned probability 1 - they just aren't the kind of thing P works on. So I felt that calling this an schema distracted attention from what reflection is, which is a property of P, not axioms for the lower system.

But I may be misunderstanding the setup here...

Right, here's what's going on. The statement in the paper is,

%20%3C%20b)\implies\mathbb{P}(a%20%3C%20\mathbb{P}(\ulcorner\varphi\urcorner)%20%3C%20b)%20=%201.)

The idea is *not* that this whole statement is an axiom schema. Instead, the idea is that the schema is the collection of the axioms %20%3C%20b) for all rational numbers a,b and sentences phi such that %20%3C%20b). The full statement above is saying that each instance of this schema is assigned probability 1. (From what I remember of a previous conversation with Paul, I'm pretty confident that this is the intended interpretation.) The language in the paper should probably be clearer about this.

That makes sense...

Using diagonalisation, define G ⇔ -1<P('G')<1

Then P(G)<1

I think this should be "Then P(G) <= 1".

No, that's correct. P(G)<1 is the premise of a proof by contradiction.

Ah. I didn't quite understand what you were trying to do there.

In general in this theory, I don't think we have (p(phi) < 1 -> p(phi) =1) -> p(phi) = 1. The theory only reflects on itself to within epsilon, so what the theory would conclude if it were *certain* about p(phi) being less than 1 can't be taken as the premise of a proof-by-contradiction in the same way that ~phi can be taken as a premise... ergh, I'm trying to figure out a more helpful way to state this than "I don't think you can derive that kind of probabilistic proof by contradiction from the stated properties of the theory because the system is never supposed to know the exact probabilities it assigns".

EDIT: Retracted pending further cogitation.

Stuart's proof by contradiction goes through, as far as I can see. (The speculation in the last paragraph doesn't, as Paul notes, and I don't know what "Hence P('G')=1" is supposed to mean, but the proof by contradiction part does work.)

**ETA:** I now think that "Hence P('G') = 1" is supposed to mean that this statement is implied by the first-order theory (T + the reflection axioms).

I used the axioms as stated, and any non-intuitionist logic. If we want to capture your intuition, I think we'll have to tweak the definitions...

(Bit off-topic:)

Why are we so invested in solving the Löb problem in the first place?

In a scenario in which there is AGI that has not yet foomed, would that AGI not refrain from rewriting itself until *it* had solved the Löb problem, just as Gandhi would reject taking a pill which would make him more powerful at the risk of changing his goals?

In other words, does coming up with a sensible utility function not take precedence? Let the AGI figure out the rewriting details, if the utility function is properly implemented, the AGI won't risk changing itself until it has come up with its own version of staying reflectively consistent. If it did not protect its own utility function in such a way, that would be such a fundamental flaw that having solved Löb's problem wouldn't matter, it would just propagate the initial state of the AGI not caring about preserving its utility function.

It certainly would be nice having solved that problem in advance, but since resources are limited ...

edit: If the anwer is along the lines of "we need to have Löb's problem solved in order to include 'reflectively consistent under self-modication' in the AGI's utility function in a well-defined way" I'd say "Doesn't the AGI already implicitly follow that goal, since the best way to adhere to utility function U1 is to keep utility function U1 unchanged?" This may not be true, maybe without having solved Löb's problem, AGI's may end up wireheading themselves.

It's somewhat tricky to separate "actions which might change my utility function" from "actions". Gandhi might not want the murder pill, but should he eat eggs? They have cholesterol that can be metabolized into testosterone which can influence aggression. Is that a sufficiently small effect?

Any kind of self-modification would be out of the question until the AGI has solved the problem of keeping its utility function intact.

That being said, it should be easier for an AGI to keep its code unmodified while solving a problem than for Gandhi not to eat.

That being said, it should be easier for an AGI to keep its code unmodified while solving a problem than for Gandhi not to eat.

That is not clear to me. If I haven't yet worked out how to architect myself such that my values remain fixed, I'm not sure on what basis I can be confident that observing or inferring new things about the world won't alter my values. And solving problems without observing or inferring new things about the world is a tricky problem.

How would any epistemic insights change the terminal values you'd want to maximize? They'd change your actions pursuant to maximizing those terminal values, certainly, but your own utility function? Wouldn't that be orthogonality-threatening? edit: I remembered this may be a problem with e.g. AIXI[tl], but with an actual running AGI? Possibly.

If you cock up and define a terminal value that refers to a mutable epistemic state, all bets are off. Like Asimov's robots on Solaria, who act in accordance with the First Law, but have 'human' redefined not to include non-Solarians. Oops. Trouble is that in order to evaluate how you're doing, there has to be some coupling between values and knowledge, so you must prove the correctness of that coupling. But what is correct? Usually not too hard to define for the toy models we're used to working with, damned hard as a general problem.

Beats me, but I don't see how a system that's not guaranteed to keep its values fixed in the first place can be relied upon not to store its values in such a way that epistemic insights won't alter them. If there's some reason I should rely on it not to do so, I'd love an explanation (or pointer to an explanation) of that reason.

Certainly, I have no confidence that I'm architected so that epistemic insights can't alter whatever it is in my brain that we're talking about when we talk about my "terminal values."

The trouble is basically: there is a good chance we can build systems that---in practice---do self-modification quite well, while not yet understanding any formalism that can capture notions like value stability. For example, see evolution. So if we want to minimize the chance of doing that, one thing to do is to first develop a formalism in which goal stability makes sense.

(Goal stability is one kind of desirable notion out of many. The general principle is: if you don't understand how or why something works, it's reasonably likely to not do quite what you want it to do. If you are trying to build an X and want to make sure you understand how it works, a sensible first step is to try and develop an account of how X could exist at all.)

Evolution isn't an AGI, if we're talking not about "systems that do self-modification" in general, but an "AGI that could do self modification", does it need to initially contain an *additional, explicit formalism* to capture notions like value stability? Failing to have such a formalism, would it then not care what happens to its own utility function?

Randomizing itself in unpredictable ways would be counter to the AGI's current utility function. Any of the actions the AI takes is by definition intended to serve its current utility function - even if that intent seems delayed, e.g. when the action is taking measurements in order to build a more accurate model, or self-modifying to gain more future optimizing power.

Since self-modification in an unpredictable way is an action that strictly jeapardizes the AGI's future potential to maximize its *current* utility function (the only basis for the decision concerning *any* action, including whether the AGI will self-modify or not), the safeguard against unpredictable self-modification would be inherently engrained in the AGI's desire to only ever maximize its current utility function.

Conclusion: The formalism that saves the AGI from unwanted self-modification is its desire to fulfill its current utility function. The AGI would be motivated to develop formalisms that allow it to self modify to better optimize its current utility function in the future, since that would maximize its current utility function better (what a one-trick-pony!).

Since self-modification in an unpredictable way is an action that strictly jeapardizes the AGI's future potential to maximize its current utility function (the only basis for the decision concerning any action, including whether the AGI will self-modify or not), the safeguard against unpredictable self-modification would be inherently engrained in the AGI's desire to only ever maximize its current utility function.

I understand the point you are making about value stability.

With our current understanding, an AI with this architecture would not do anything productive. The concern *isn't* that an AI with this architecture would do something bad, it's that (in light of the fact that it would not do anything productive) you wouldn't build it. Instead you would build something different; quite possibly something you understand less well and whose good behavior is more of an empirical regularity (and potentially more unstable). Humans as the output of evolution are the prototypical example of this, though many human artifacts have the same character to varying degrees (programs, organizations, cities, economies).

With our current understanding, an AI with this architecture would not do anything productive.

Is that because any non-trivial action could run a chance of changing the AGI, and thus the AGI wouldn't dare do anything at all? (If (false), disregard the following. Return 0;).

If (true), with goal stability being a paramount invariant, would you say that the AGI needs to extrapolate the effect any action would have on itself, before executing it? As in "type 'hi'" or "buy an apple" being preceded by "prove this action maintains the invariant 'goal stability'".

It seems like such an architecture wouldn't do much of anything either, combing through its own code whenever doing anything. (Edit: And competing teams would be quick to modify the AGI such that it checks less.)

If you say that not every action necessitates proving the invariant over all of its code, then an AI without having a way of proving actions to be non-invariant-threatening could do any actions that wouldn't result in a call to the (non-existing) "prove action isn't self-modifying in a goal shifting way".

Is that because any non-trivial action could run a chance of changing the AGI, and thus the AGI wouldn't dare do anything at all? (If (false), disregard the following. Return 0;).

That or it takes actions changing itself without caring that they would make it worse because it doesn't know that its current algorithms are worth preserving. Your scenario is what might happen if someone notices this problem and tries to fix it by telling the AI to never modify itself, depending on how exactly they formalize 'never modify itself'.

What if, in building a non-Löb-compliant AI, you've already failed to give it part of your inference ability / trust-in-math / whatever-you-call-it? Even if the AI figures out how to not lose any more, that doesn't mean it's going to get back the part you missed.

Possibly related question: Why try to solve decision theory, rather than just using CDT and let it figure out what the right decision theory is? Because CDT uses its own impoverished notion of "consequences" when deriving what the consequence of switching decision theories is.

Told about the Whitely sentence to a friend who is studying physics. After two minutes of reflection, he suggested: "it ought to be like a delta function around 30%". Needless to say, I was blown away.

This is all nifty and interesting, as mathematics, but I feel like you are probably barking up the wrong tree when it comes to applying this stuff to AI. I say this for a couple of reasons:

First, ZFC itself is already comically overpowered. Have you read about reverse mathematics? Stephen Simpson edited a good book on the topic. Anyway, my point is that there's a whole spectrum of systems a lot weaker than ZFC that are sufficient for a large fraction of theorems, and probably all the reasoning that you would ever need to do physics or make real word, actionable decisions. The idea that physics could depend on reasoning of a higher consistency strength than ZFC just feels really *wrong* to me. Like the idea that P could really equal NP. Of course my gut feeling isn't evidence, but I'm interested in the question of why we disagree. Why do you think these considerations are likely to be important?

Second, Isn't the the whole topic of formal reasoning a bike shed? Isn't the real danger that you will formalize the goal function wrong, not that the deductions will be invalid?

The result works for theories as simple as Peano arithmetic.

The problem is if your mathematical power has to go down each time you create a successor or equivalently self-modify. If PA could prove itself sound that might well be enough for many purposes. The problem is if you need a system that proves another system sound and in this case the system strength has to be stepped down each time. That is the Lob obstacle.

Each time you create a successor or equivalently self-modify, or rather each time that an action that you take has a non-zero chance of modifying yourself? Wouldn't that mean you'd have to check constantly? What if the effect of an action is typically unpredictable at least to some degree?

Also, since any system is implemented using a physical substrate, some change is inevitable (and until the AI has powered up to multiple redundancy level not yet so stochastically unlikely as to be ignorable). What happens if that change affects the "prove the system sound" physical component, is there a provable way to safeguard against that? (Obligatory "who watches the watcher".)

It's a hard problem any way you slice it. You there, go solve it.

So you are assuming that it will be wanting to prove the soundness of any successors? Even though it can't even prove the soundness of itself? But it can believe in it's own soundness in a Bayesian sense without being able to prove it. There is not (as far as I know) any Godelian obstacle to that. I guess that was your point in the first place.

Why would successors use a different system, though? Verifying proofs in formal systems is easy, it's coming up with the proofs that's difficult -- an AI would refine its heuristics in order to figure out proofs more efficiently, but it would not necessarily want to change the system it is checking them against.

It *would* want to change the system it checked proofs against if it did not trust that system. A naively built system which checked its proofs with PA but did not trust PA probabilistically (ie, held some probability that PA is false: not difficult to construct) would very possibly prefer to reduce its set of axioms to something which PA verifies (something true with 100% probability in its understanding). But (at least if the reduced system is still robinson arithmetic or stronger) the cycle would continue, since the new system is also not self-verifying.

This cycle could (likely) be predicted by the system, however, so it is not obvious what the system would actually choose to do.

If the system did not trust PA, why would it trust a system because PA verifies it? More to the point, why would it trust a self-verifying system, given that past a certain strength, only inconsistent systems are self-verifying?

If the system held some probability that PA was inconsistent, it could evaluate it on the grounds of usefulness, perhaps contrasting it with other systems. It could also try to construct contradictions, increasing its confidence in PA for as long as it doesn't find any. That's what we do, and frankly, I don't see any other way to do it.

If the system did not trust PA, why would it trust a system because PA verifies it?

Because that's how it works! The system "is" PA, so it will trust (weaker) systems that it (PA) can verify, but it will not trust itself (PA).

More to the point, why would it trust a self-verifying system, given that past a certain strength, only inconsistent systems are self-verifying?

It would only trust them if it could verify them.

If the system held some probability that PA was inconsistent, it could evaluate it on the grounds of usefulness, perhaps contrasting it with other systems.

True.

It could also try to construct contradictions, increasing its confidence in PA for as long as it doesn't find any.

Not necessarily; this depends on how the system works. In my probabilistic prior, this would work to some degree, but because there exists a nonstandard model in which PA is inconsistent (there are infinite proofs ending in contradictions), there will be a fixed probability of inconsistency which cannot be ruled out by any amount of testing.

Because that's how it works! The system "is" PA, so it will trust (weaker) systems that it (PA) can verify, but it will not trust itself (PA).

That doesn't seem consistent to me. If you do not trust yourself fully, then you should not fully trust anything you demonstrate, and even if you do, there is still no incentive to switch. Suppose that the AI can demonstrate the consistency of system S from PA, and wants to demonstrate proposition A. If AI trusts S as demonstrated by PA, then it should also trust A as demonstrated by PA, so there is no reason to use S to demonstrate A. In other words, it is not consistent for PA to trust S and not A. Not *fully*, at any rate. So why use S at all?

What may be the case, however, (that might be what you're getting to) is that in demonstrating the consistency of S, PA assigns `P(Consistent(S)) > P(Consistent(PA))`

. Therefore, what both PA and S can demonstrate would be true with greater probability than what only PA demonstrates. However, this kind of system solves our problem in two ways: first, it means the AI can keep using PA, but can increase its confidence in some statement A by counting the number of systems that prove A. Second, it means that the AI can increase its confidence in PA by arbitrarily building *stronger* systems and proving the consistency of PA from within these stronger systems. Again, that's similar to what we do.

Not necessarily; this depends on how the system works. In my probabilistic prior, this would work to some degree, but because there exists a nonstandard model in which PA is inconsistent (there are infinite proofs ending in contradictions), there will be a fixed probability of inconsistency which cannot be ruled out by any amount of testing.

That sounds reasonable to me -- the usefulness of certainty diminishes sharply as it approaches 1 anyway. Your paper sounds interesting, I'll give it a read when I have the time :)

You are smuggling in some assumptions from your experience with human cognition.

If I believe X, but don't believe that the process producing my beliefs is sane, then I will act on X (I believe it, and we haven't yet talked about any bridge between what I believe, and what I believe about my beliefs), but I still won't trust myself in general.

I certainly agree that "if I don't trust myself, I don't trust my output" is an assumption I draw from human cognition, but if I discard that assumption and those that seem in the same reference class it I'm not sure I have much of a referent left for "I trust myself" at all. Is there a simple answer to what it *means* for me to trust myself, on your account? That is, how could I tell whether or not I did? What would be different if I did or didn't?

E.g., do you think the generalization "things I believe are true" is true? Would you be comfortable acquiring influence that you can use as you see fit, because you trust yourself to do something reasonable?

I have a machine which maximizes expected utility given the judgments output by some box O. If O says "don't trust O!" it won't cause the machine to stop maximizing expected utility according to O's judgments---that's just what it is programmed to do. But it will cause the machine to doubt that it will do a good job, and work to replace itself with an alternative. The details of exactly what happens obviously depend on the formalism, but hopefully the point makes sense.

E.g., do you think the generalization "things I believe are true" is true? Would you be comfortable acquiring influence that you can use as you see fit, because you trust yourself to do something reasonable?

I guess I'm just not following you, sorry. It seems to me that these are precisely the sorts of things that follow from "I trust myself" in the derived-from-human-cognition sense that I thought you were encouraging us to reject.

I have a machine which currently maximizes expected utility given the judgments output by O.

The question arises, is it programmed to "maximize expected utility given the judgments output by O", or is it programmed to "maximize expected utility" and currently calculates maximum expected utility as involving O's judgments?

If the former, then when O says "don't trust O!" the machine will keep maximizing expected utility according to O's judgments, as you say. And it's maximum expected utility will be lower than it previously was. But that *won't* cause it to seek to replace itself. Why should it?

If the latter, then when O says "don't trust O!" the machine will immediately lower the expected utility that it calculates from O's judgments. If that drops below the machine's calculated expected utility from some other source P, it will immediately start maximizing expected utility given the judgments output by P instead. It might replace itself with an alternative, just as it might have if O hadn't said that, but O saying that shouldn't affect that decision one way or the other.

No?

If you are merely using O's judgments as evidence, then you are ultimately using some other mechanism to form beliefs. After all, how do you reason about the evidential strength of O's judgments? Either that, or you are using some formalism we don't yet understand. So I'll stick with the case where O is directly generating your beliefs (that's the one we care about).

If O is the process generating your beliefs, then when O says "don't trust O" you will continue using O because that's what you are programmed to do, as you wrote in your comment. But you can reason (using O) about what will happen if you destroy yourself and replace yourself with a new agent whose beliefs are generated by O'. If you trust O' much more than O then that would be a great change, and you would rush to execute it.

(nods) Agreed; if O is generating my beliefs, and O generates the belief in me that some other system having O'-generated beliefs would be of greater value than me having O-generated beliefs, then I would rush to execute that change and replace myself with that other system.

For similar reasons, if if O is generating my beliefs, and O generates the belief in me that *me* having O'-generated beliefs would be of greater value than me having O-generated beliefs, then I would rush to execute that change and accept O'-generated beliefs instead. But, sure, if I'm additionally confident that I'm incapable of doing that for some reason, then I probably wouldn't do that.

The state of "I believe that trusting O' would be the most valuable thing for me to do, but I can't trust O'," is not one I can actually imagine being in, but that's not to say cognitive systems can't exist which are capable of being in such a state.

Have I missed anything here?

As others have said, doesn't much matter whether you use ZFC or any other system.

On the second point: one general danger is that your ability to build systems to do X will outpace your understanding of systems that do X. In this case, you might mess up and not quite get what you want. One way to try and remedy this is to develop a better formal understanding of "systems that do X." AGI seems like an important case of this pattern, because there is some chance of serious negative consequences from failing, from building systems whose behavior you don't quite understand. (I tend to think this probability is smaller than Eliezer, but I think most everyone who has thought about it seriously agrees that this is a possible problem.) At the moment we have no such formal understanding for AGI, so it might be worth thinking about.

Isn't the real danger that you will formalize the goal function wrong, not that the deductions will be invalid?

Both are huge difficulties, but most of the work in FAI is probably in the AI part, not the F part.

Both are huge difficulties, but most of the work in FAI is probably in the AI part, not the F part.

It is worth noting that taking the "F" seriously implies adding a rather significant amounts of work to the "AI part". It requires whole extra orders of formal rigor and all the additional complications of provable goal stability under self improvement (regardless of what those goals happen to be). While this doesn't matter for the purpose of answering smoofra's question it seems to me that it could could be potentially misleading to neglect the difference in workload between creating "F-compatible AI" and "AI" when talking about the workload imposed by 'F'.

Note that I don't *think* I'm saying something controversial here. I am expecting this to just be wording that I am wary of rather than a fundamentally different understanding. But if I have actually misunderstood the MIRI position on the relative difficulty of Friendliness to arbitrary AI then I would appreciate being corrected. That would be significant new information for me to consider (and also extremely good news!)

The reason this is significant can of course be illustrated by considering the counterfactual world where a convergence thesis holds. Or, more relevantly, considering the possibility of GAI researchers that believe that a convergence thesis holds but somehow manage to be competent researchers anyhow. Their task becomes (crudely speaking) that of creating any AI that can make something smarter than itself. My estimate is that this is an order of magnitude simpler than the task for FAI creators creating an AI, even completely neglecting the work that goes into creating the goal system.

If I find out that I am mistaken about the relative difficulties here then I will get to drastically update my expectation of humanity surviving and in general in the direction of awesome things happening.

**[deleted]**· 2013-03-26T01:15:59.713Z · score: 0 (0 votes) · LW(p) · GW(p)

I take it that by "convergence thesis", you're referring to the statement that all sufficiently intelligent agents will have approximately the same values.

Another stupid question: would relaxing the schema 3 to incomplete certainty allow the distribution P to be computable? If so, can you get arbitrarily close to certainty and still keep P nice enough to be useful? Does the question even make sense?

The computability problem is due to the unbounded ability of P to reason about theorems in the base theory (so that would be a necessary point of relaxation). A computable total function can't even assign values > 1/2 to all the theorems of PA and values < 1/2 to all the sentences refuted by PA.

A stupid question from a dilettante... Does this avenue of research promise to eventually be able to assign probabilities to interesting statements, like whether P=NP or whether some program halts, or whether some formal system is self-consistent?

Promise? No. Might it start down an avenue that someday leads there after a whole lot more work and some additional ideas? Maybe. Don't hold your breath on that P!=NP formal probability calculation.

Hey, a man can hope. Interesting how it used to be physics driving the progress in math, and lately it's been computer science.

This comment is a grossly oversimplified perspective, I think.

Maybe I'm misunderstanding here, but it seems like we have no particular reason to suppose P=NP is independent of ZFC. Unless it is independent, its probability under this scheme must already be 1 or 0, and the only way to find out which is to prove or disprove it.

I think shminux is talking about the possibility of future research addressing bounded reasoners, who could be uncertain of P=NP even if it followed from ZFC.

I fully agree that is an interesting avenue of discussion, but it doesn't look much like what the paper is offering us.

http://www.academia.edu/1126434/IT_IS_UNLIKELY_THAT_P_NP_IS_INDEPENDENT_OF_ZFC

... Or a probability for the continuum hypothesis, axiom of choice, et cetera, if the probabilistic set theory works out. :)

While set theory already does that within forcing language (kinda, truth values are in a boolean algebra instead of a ring) for CH, AC, etc., the values of P!=NP cannot be changed if the models have the same ordinal (due to Shoenfield's absoluteness theorem). I really hope that Probabilistic Set Theory works out, it seems very interesting.

The idea is similar in concept to fuzzy theories of truth (which try to assign fractional values to paradoxes), but succeeds where those fail (ie, allowing quantifiers) to a surprising degree. Like fuzzy theories and many other extra-value theories, the probabilistic self-reference is a "vague theory" of sorts; but it seems to be minimally vague (the vagueness comes down to infinitesimal differences!).

Comparing it to my version of probabilities over logic, we see that this system will trust its axioms to an arbitrary high probability, whereas if my system were given some axioms such as PA or ZFC, it would assign probability 1 to those axioms, but assign a nonzero probability to the axioms being inconsistent. (This is *not* a direct comparison: I'm comparing a self-referential probability with a probability-of-inconsistency. To make a direct comparison we would have to establish whether the arbitrarily high self-referential probability entailed that the system placed an arbitrarily low probability on its axioms being inconsistent. An additional question would be whether it believed that the axioms plus the reflection principle are probably consistent.)

Comparing to my current theory of truth, the idea of probabilistic self-reference seems to achieve almost everything that is classically desired of a theory of truth (to within an epsilon!). My current understanding, on the other hand, says that some of those things were wrong-headed (we should not have desired them in the first place); in particular, allowing a truth predicate in the same context as a diagonal lemma holds doesn't make sense, because it leads very directly to the creation of nonsense expressions which intuitively seem to have no meaning. So, we could ask: what is the benefit of succesfully providing truth-values for these nonsense expressions? Is it not better to rult out the nonsense?

The reply seems to be: probabilistic self-reference may include a bunch of nonsense, but it does so at little cost, and we seem to get more (desirable) structure with fewer assumptions in this way. (The test would be to try to buld up foundational mathematics and see how natural each approach seemed.)

I liked your paper at AGI (we were discussing the same ideas at MIRI at that time; I guess it's in the air at the moment). Our system is a definition of truth rather than a definition of beliefs. In fact P( "axioms inconsistent" ) > 0 in our formalism. There is definitely much more to say on this topic (I'm currently optimistic about resolving incompleteness up to some arbitrarily small epsilon, but it involves other ideas.)

I agree that some self-referential sentences have no meaning, but I think accepting them is philosophically unproblematic and there are lots of meaningful self-referential sentences (e.g. believing "everything I believe is true") which I'd prefer not through out with the bathwater.

I guess it's in the air at the moment

Indeed, it seems relatively easy to make progress in this direction given the current state of things! (I would have been surprised if you had *not* been thinking along similar lines to my paper.)

In fact P( "axioms inconsistent" ) > 0 in our formalism.

Ah! Interesting. Why is that?

I did construct a distribution in which this is not the case, but it was not particularly satisfying. If (during the random theory generation) you block the creation of existential statements until an example has already been introduced, then you seem to get acceptable results. However, the probability distribution does not seem obviously correct.

We say that P is coherent if there is a probability measure μ over models of L such that

Annoying pedantic foundational issues: The collection of models is a proper class, but we can instead consider probability measures over the set of consistent valuations , which is a set. And it suffices to consider the product σ-algebra.

The collection of models is a proper class

I assumed that this meant "there is some probability space whose set of outcomes is some set of models", rather than "there is a measure over 'the canonical' probability space of all models", whatever that could mean.

And it suffices to consider the product σ-algebra.

This is the smallest σ-algebra such that for every sentence phi of L, {A : phi in A} is measurable, right? If so, it is pleasingly also the Borel-σ-algebra on the Stone space of consistent valuations (aka complete theories) of L. (Assuming L is countable, but I believe the paper does that anyway.)

I assumed that this meant "there is some probability space whose set of outcomes is some set of models", rather than "there is a measure over 'the canonical' probability space of all models", whatever that could mean.

Yes, I asked Paul about this and this is the interpretation he gave.

I gave this paper a pretty thorough read-through, and decided that the best way to *really* see what was going on would be to write it all down, expanding the most interesting arguments and collapsing others. This has been my main strategy for learning math for a while now, but I just decided recently that I ought to start a blog for this kind of thing. Although I wrote this post mainly for myself, it might be helpful for anyone else who's interested in reading the paper.

Towards the end, I note that the result can be made "constructive", in the sense of avoiding the Axiom of Choice by absoluteness arguments. Basically by Shoenfield's absoluteness theorem, but I went through the argument by hand because I was not aware of that theorem until I went to Wikipedia to find something to link for absoluteness.

Note that there's a second sense in which the argument is nonconstructive: Kakutani requires as a hypothesis that the space be compact. In your post you say [0, 1]^{\omega} is compact by Tychonoff's theorem, but using Tychonoff's theorem in its general form here is a use of choice. For compact Hausdorff spaces it's enough to use the ultrafilter lemma, but I believe you can just prove by hand that [0, 1]^{\omega} is compact in ZF.

I'm glad you bring this up. I have two comments:

(1) The absoluteness argument shows that intermediate uses of Choice don't matter. They happen (definably) in L, and the conclusion reflects back up to V.

(2) As you suspect, you don't really need Choice for compactness of the Hilbert cube. Given a sequence in it, I can define a convergent subsequence by repeatedly pigeonholing infinitely many points into finitely many holes. I believe sequential compactness is what is needed for the Kakutani theorem.

Variants of open cover compactness are derivable from sequential compactness without Choice in Polish spaces, in case something like that is needed. Or if you somehow *really* need *full* open cover compactness... well, return to point (1), I guess.

The above is an extraordinarily good example of why lukeprog no longer talks to you.

What would the world look like if I assigned "I assign this statement a probability less than 30%." a probability of 99%? What would the world look like if I assigned it a probability of 1%?

It means that you are rather confused about your function for assigning probabilities. In other words, you think you assign probabilities a different way from how you really do.

Suppose that my method of assigning probabilities about things that should not update on new evidence of any kind (invisible dragons) is to assign them either a very low probability (something which I almost always do), or a very high probability (something I very rarely do)?

This statement seems insane in that one should always update based only on one's current evaluation of it is (if you make the change "less than *or equal to* 30%", it has a stable solution)

This idea reminds me of some things in mechanism design, where it turns out you can actually prove a lot more for weak dominance than for strong dominance, even though one might naïvely suppose that the two should be equivalent when dealing with a continuum of possible actions/responses. (I'm moderately out of my depth with this post, but the comparison might be interesting for some.)

Just found a few typos in my notes that I had forgotten to post. Hopefully somebody's still reading this.

- Page 3: "By axiom 3, P(T0) = 1" should be by axiom 2.
- Axiom 3 seems redundant (it follows from axiom 1 + 2).
- Page 4, "which contradicts P(G) in [0, 1]" should be "which contradicts P(G) in [0, 1)", unless I'm misreading.

In the proof of Theorem 2, you write "Clearly is convex." This isn't clear to me; could you explain what I am missing?

More specifically, let ) be the subset of obeying %20%3C%20b%20\%20\implies%20\%20\mathbb{P}\left(%20a%20%3C%20\mathbb{P}(\lceil%20\phi%20\rceil)%20%3C%20b%20\right)%20=1%20). So }%20X(\phi,a,b)). If ) were convex, then would be as well.

But ) is not convex. Project ) onto in the coordinates corresponding to the sentences and %20%3C%20b). The image is %20\cup%20\left(%20%20[0,1]%20\times%20\{%201%20\}%20\right)%20\cup%20\left(%20[b,1]%20\times%20[0,1]%20\right)). This is not convex.

Of course, the fact that the $X(\phi,a,b)$ are not convex doesn't mean that their intersection isn't, but it makes it non-obvious to me why the intersection should be convex. Thanks!

The set A is convex because the convex combination (t times one plus (1-t) times the other) of two coherent probability distributions remains a coherent probability distribution. This in turn is because the convex combination of two probability measures over a space of models (cf. definition 1) remains a probability distribution over the space of models.

I think, but am not sure, that your issue is looking at arbitrary points of [0,1]^{L'}, rather than the ones which correspond to probability measures.

Other nitpicks (which I don't think are real problems):

If the Wikipedia article on Kakatuni's fixed point theorem is to be believed, then Kakatuni's result is only for finite dimensional vector spaces. You probably want to be citing either Glicksberg or Fan for the infinite dimensional version. These each have some additional hypotheses, so you should check the additional hypotheses.

At the end of the proof of Theorem 2, you want to check that the graph of is closed. Let be the graph of . What you check is that, if ) is a sequence of points in which approaches a limit, then that limit is in . This set off alarm bells in my head, because there are examples of a topological space , and a subspace , so that is not closed in but, if is any sequence in which approaches a limit in , then that limit is in . See Wikipedia's article on sequential spaces. However, this is not an actual problem. Since is countable, is metrizable and therefore closure is the same as sequential closure in .

Regarding the epsilons and stuff: you guys might want to look in to hyperreal numbers.

We know. (The situation should be something like the following: internal statements about the probabilities assigned to various statements get interpreted in nonstandard models, so the corresponding probabilities are nonstandard reals.) Nonstandard models are an essential component of the approach here: probability assignments are the same thing as probability distributions over models (only one of which is standard).

Not really my field of expertise, thus I may be missing something, but can't you just set P = 1 for all classically provably true statements, P = 0 for all provably false statements and P = 0.5 to the rest, essentially obtaining a version of three-valued logic?

If that's correct, does your approach prove anything fundamentally different than what has been proved about known three-valued logics such as Kleene's or Łukasiewicz's?

In ZF set theory, consider the following three statements.

I) The axiom of choice is false

II) The axiom of choice is true and the continuum hypothesis is false

III) The axiom of choice is true and the continuum hypothesis is true

None of these is provably true or false so they all get assigned probability 0.5 under your scheme. This is a blatant absurdity as they are mutually exclusive so their probabilities cannot possibly sum to more than 1

I have felt probabilistic reasoning about mathematics is important ever since I learned about the Goedel machine, for a reason related but somewhat different from Loeb's problem. Namely, the Goedel machine is limited to self-improvements provable within an axiom system A whereas humans seem to be able to use arbitrarily powerful axiom systems. It can be argued that Piano arithmetics is "intuitively true" but ZFC is only accepted by mathematicians because of empirical considerations: it just works. Similarly a self-improving AI should be able to accept empirical evidence for the consistency of an axiom systems *more* powerful than the system "hard-coded" within it. More generally, the utility expectation value in the criterion for a Goedel machine reprogramming itself should average not only over external physical realities (described by the Solomonoff semi-measure) but over *mathematical* possibilities as well

a day after the workshop was supposed to end

My memory is that this was on the last day of the workshop, when only Paul and Marcello had the mental energy reserves to come into the office again.

No, I allocated my energy for a week and then it turned out that everyone was in town for an extra day anyway so it went into overtime. I thought Mihaly made it? Anyway I, having judged my energy reserves accurately, didn't.

Typo in section 2.2 Going Meta - agumented should be augmented

When reading this paper, and the background, I have a recurring intuition that the best approach to this problem is a distributed, probabilistic one. I can't seem to make this more coherent on my own, so posting thoughts in the hope discussion will make it clearer:

ie, have a group of related agents, with various levels of trust in each others' judgement, each individually asses how likely a descendant will be to take actions which only progress towards a given set of goals.

While each individual agent may only be able to asses a subset of a individual descendants actions, a good mix of agents may be able to provide a complete, or near enough to complete, cover for the descendants actions.

An individual agent can then send out source code for a potential descendant, and ask its siblings for feedback - only deciding whether to produce the descendant if enough of the correct siblings respond with a yes.

This seems like another "angels dancing on the head of a pin" question. I am not willing to assign numerical probabilities to any statement whose truth value is unknown, unless there is a compelling reason to choose that specific number (such as, the question is about a predictable process such as the result of a die roll). It seems to me that people who do so assign, are much more likely to get basic problems of probability theory (such as the Monty Hall problem) wrong than those who resist the urge.

**[deleted]**· 2013-03-26T19:30:00.071Z · score: 0 (24 votes) · LW(p) · GW(p)

This is only meaningful under the assumption that the intelligence of an AI depends on the strength of its proof system. Since the space of intelligent AI systems is not limited those that are dependent on proof systems, the entire argument has narrow scope and importance. And since, arguably, the majority of AIs capable of human-level intelligence in the real world are indeed not those that are dependent on proof systems (but are, instead, complex systems), the argument's importance diminishes to a vanishingly small level.

I can never quite tell to what extent you are being deliberately inflammatory; is there a history I'm missing?

I agree that this work is only relevant to systems of a certain kind, i.e. those which rely on formal logical manipulations. It seems unjustifiably strong to say that the work is therefore of vanishing importance. Mostly, because you can't justify such confident statements without a much deeper understanding of AGI than anyone can realistically claim to have right now.

But moreover, we don't yet have *any* alternatives to first order logic for formalizing and understanding general reasoning, and the only possibilities seem to be: (1) make significant new developments that mathematicians as a whole don't expect us to make, or (2) build systems whose reasoning we don't understand except as an empirical fact (e.g. human brains).

I don't deny that (1) and (2) are plausible, but I think that, if those are the best bets we have, we should think that FOL has a good chance of being the right formalism for understanding general reasoning. Which part of this picture do you disagree with? The claim that (1) and (2) are all we have, the inference that FOL is therefore a plausible formalism, or the claim that the OP is relevant to a generic system whose reasoning is characterized by FOL?

I'm also generally skeptical of the sentiment "build an intelligence which mimics a human as closely as possible." This is competing with the principle "build things you understand," and I think the arguments in favor of the second are currently much stronger than those in favor of the first (though this situation could change with more evidence or argument). I think that work that improves our ability to understand reasoning (of which the OP is a tiny piece) is a good investment, for that reason.

Your characterization of the OP is also not quite fair; we don't necessarily care about the strength of the underlying proof system, we are talking about the semantic issue: can you think thoughts like "Everything I think is pretty likely to be true" or can't you? We would like to have a formalism in which you can articulate such thoughts, but in which the normal facts about FOL, completeness theorems, etc. still apply. That is a much more general issue than the problem of boosting a system's proof-theoretic strength in order to boost its reasoning power.

is there a history I'm missing?

IIRC, Eliezer banned Richard from SL4 several years ago. I can't find the thread in which Eliezer banned him, but here is a thread in which Eliezer writes (to Richard) "I am wondering whether to ban you from SL4..."

After a few counter-productive discussions with Richard, I've personally stopped communicating with him.

The "bannination" is here.

EDIT: and here is Eliezer's explanation.

Note that I've personally had many productive discussions with Richard: he does have a bit of a temper, which is compounded by a history of bad experiences with communities such as SL4 and this one, but he's a very reasonable debate partner when treated with courtesy and respect.

**[deleted]**· 2013-05-09T16:56:29.409Z · score: 1 (13 votes) · LW(p) · GW(p)

It says something profound about the LessWrong community that:

(a) Whenever I post a remark, no matter how innocent, Luke Muehlhauser makes a point of coming to the thread to make defamatory remarks against me personally ..... and his comment is upvoted;

(b) When I, the victim of Muehlhauser's attack, point out that there is objective evidence to show that the defamation is baseless and unjustified .... my comment is immediately downvoted.

**[deleted]**· 2013-05-09T17:08:38.660Z · score: 1 (1 votes) · LW(p) · GW(p)

It says something profound about the ten or so people who have voted on your recent comments, assuming none of the votes come from the same person.

I was not aware of the prior history, but I tend to downvote anyone coming across as a bitter asshole with an ax to grind.

I was not aware of the prior history, but I tend to downvote anyone coming across as a bitter asshole with an ax to grind.

Ditto. I hypothesise that if Richard had used a few different words here and there to take out the overt barbs he may have far more effectively achieved his objective of gaining the moral high ground and making his adversaries look bad.

The papers that Richard mentioned in the downstream comment are useful for understanding his view: [1, 2].

I'm also generally skeptical of the sentiment "build an intelligence which mimics a human as closely as possible." This is competing with the principle "build things you understand,"

Do you really think that one can build an AGI without first getting a good understanding of human intelligence, to the degree where one can be reproduced (but possibly shouldn't be)?

Do you really think that one can build an AGI without first getting a good understanding of human intelligence, to the degree where one can be reproduced

It was possibly to achieve heavier than air flight without reproducing the flexible wings of birds.

Good understanding of the design principles *may* be enough, or of the organisation into cortical columns and the like. The rest is partly a mess of evolutionary hacks, such as "let's put the primary visual cortex in the back of the brain" (excuse the personification), and probably not integral for sufficient understanding. So I guess my question would be what granularity of "understanding" you're referring to. 'So that it can be reproduced' seems too low a barrier: Consider we found some alien technology that we could reproduce strictly by copying it, without having any idea how it actually worked.

Do you 'understand' large RNNs that exhibit strange behavior because you understand the underlying mechanisms and could use them to create other RNNs?

There is a sort of trade-off, you can't go too basic and still consider yourself to understand the higher-level abstractions in a meaningful way, just as the physical layer of the TCP/IP stack in principle encapsulates all necessary information, but is still ... user-unfriendly. Otherwise we could say we understand a human brain perfectly just because we know the laws that governs it on a physical level.

I shouldn't comment when sleep deprived ... ignore at your leisure.

And since, arguably, the majority of AIs capable of human-level intelligence in the real world are indeed not those that are dependent on proof systems (but are, instead, complex systems), the argument's importance diminishes to a vanishingly small level.

I must be missing something here, but you are saying that a plausible argument about a technology we don't yet have makes a statement about limits of a different form of that technology completely unimportant. It seems like there's a big jump here from "arguable" and "majority" to therefore this doesn't matter.

This is only meaningful under the assumption that the intelligence of an AI depends on the strength of its proof system.

Edit:

The *intelligence* of the AI? The proof system is necessary to provably keep the AI's goals invariant. Its epistemic prowess ("intelligence") need not be dependent on the proof system. The AI could use much weaker proof systems - or even just probabilistic tests such as "this will probably make me more powerful" for most of its self-modifying purposes, just as you don't have a proof system that reading a certain book will increase your intelligence.

However, if we want to keep crucial properties such as the utility function as provable invariants, that's what we'd need such a strong proof system for, by definition.

**[deleted]**· 2013-03-26T23:58:03.039Z · score: -5 (17 votes) · LW(p) · GW(p)

A pity that you cannot be more eloquent, or produce any argument to support your claim that "No".

I have done both, in published papers (cf Loosemore, R.P.W. (2007). Complex Systems, Artificial Intelligence and Theoretical Psychology. In B. Goertzel and P. Wang (Eds.) Proceedings of the 2006 AGI Workshop. IOS Press, Amsterdam, and Loosemore, R.P.W. (2011b). The Complex Cognitive Systems Manifesto. In The Yearbook of Nanotechnology, Volume III: Nanotechnology, the Brain, and the Future, Eds. Sean Hays, Jason Scott Robert, Clark A. Miller, and Ira Bennett. New York, NY: Springer.

But don't mind me. A voice of sanity can hardly be expected to be listened to under these circumstances.

`s/adopoting/adopting/`

... Why does it matter to an AI operating within a system p if proposition C is true? Shouldn't we need to create the most powerful sound system p, and then instead of caring about what is true, care about what is provable?

EDIT: Or instead of tracking 'truth', track 'belief'. "If this logical systems proves proposition C, then C is believed.", where 'believed' is a lower standard than proved (Given a proof of "If a than b" and belief in a, the result is only a belief in b)

I think this was how the heretic Geth got started. #generalisingfromfictionalevidence