Logical uncertainty, kind of. A proposal, at least.

post by Manfred · 2013-01-13T09:26:15.670Z · LW · GW · Legacy · 35 comments

If you want context and are fast at seeing the implications of math, see Benja's post. This post is much lighter on the math, though it may take more background reading and more laborious interpolation, since it's, well, lighter on the math.

 

Imagine I introduced my pet robot to a game. The robot has 10 seconds to pick a digit, and if the trillionth prime number ends with that digit, the robot gets a cookie (it likes peanut butter cookies the best). 10 seconds is not enough time for my robot to calculate the answer deductively. And yet, guessing an answer is superior to running out of time quietly. What sort of general logic should my robot follow in under 10 seconds to figure out that it should be indifferent between answering 1, 3, 7 or 9? Does it even make sense to be indifferent between the real answer and an impossible answer, even if you don't know which is which?

 

As you might expect from context, the proposed solution will involve assigning every true or false math statement a probabability-esque degree of plausibility, with numbers other than 0 or 1 indicating logical uncertainty. Why is this a good idea?

 

To explain logical uncertainty, let's first take a step back and reframe logical certainty in terms of rules for reasoning that apply to both deductive logic and probabilistic logic. An important resource here is E.T. Jaynes' Probability Theory (pdf) - the most relevant part being page 31 of the book. The key idea is that each of the probability axioms applies just fine no matter what kind of Boolean statement you want to find the probability of. Which is to say probability already applies to arithmetic - the laws of probability are also laws of arithmetic, just in the limit that probabilities go to 1 or 0. Our robot starts with a collection of definitions labeled with probability 1 (like "0 is a number" or "S(0)+0=S(0)" [if this S(0) stuff needs context, see wikipedia]), and then applies deductive rules according to the universal rules of probability. We translate "A implies B" into the language of probabilities as P(AB|C) = P(A|C), and then apply the always-true product rule P(B|AC)=P(AB|C) / P(A|C). If P(A|C)=1, that is, A|C is deductively true, and A implies B, then P(B|AC)=P(B|C)=1. The machinery that underlies deduction is in fact the same machinery that underlies probabilistic reasoning. And we're just going to exploit that a little.

An alternate axiomatization due to Savage (hat tip to articles by Sniffoy and fool) is based just on actions - it doesn't seem necessary for every agent to store numerical plausibilities, but every agent has to act, and if our agent is to act as if it had consistent preferences when presented with bets, it must act as if it calculated probabilities. Just like the conditions of Cox's theorem as used by E.T. Jaynes, the conditions of Savage's theorem apply to bets on arithmetic just fine. So our robot always behaves as if it assigns some probabilities over the last digit of the trillionth prime number - it's just that when our robot's allowed to run long enough, all but one of those probabilities is 0.

 

So how do we take the basic laws of belief-manipulation, like the product rule or the sum rule, and apply them to cases where we run out of time and can't deduce all the things? If we still want to take actions, we still want to assign probabilities, but we can't use deduction more than a set number of times...

Okay fine I'll just say it. The proposal outlined here is to treat a computationally limited agent's "correct beliefs" as the correct beliefs of a computationally unlimited agent with a limited definition of what deduction can do. So this weakened-deduction agent has a limitation, in that starting from axioms it can only prove some small pool of theorems, but it's unlimited in that it can take the pool of proven theorems, and then assign probabilities to all the unproven true or false statements. After we flesh out this agent, we can find a computationally limited algorithm that finds correct (i.e. equal to the ones from a sentence ago) probabilities for specific statements, rather than all of them. And finally, we have to take this and make a decision procedure - our robot. After all, it's no good for our robot to assign probabilities if it proceeds to get stuck because it tries to compare the utilities of the world if the end of the trillionth prime number were 1 versus 7 and doesn't even know what it means to calculate the utility of the impossible. We have to make a bit of a modification to the whole decision procedure, we can't just throw in probabilities and expect utility to keep up.

 

So, formally, what's going on when we limit deduction? Well, remember the process of deduction outlined earlier?

We translate "A implies B" into the language of probabilities as P(AB|C) = P(A|C), and then apply the always-true product rule P(B|AC)=P(AB|C) / P(A|C). If P(A|C)=1, that is, A|C is deductively true, and A implies B, then P(B|AC)=P(B|C)=1

There is a chain here, and if we want to limit deduction to some small pool of provable theorems, we need one of the links to be broken outside that pool. As implied, I don't want to mess with the product rule, or else we violate a desideratum of belief. Instead, we'll mess with implication itself - we translate "A implies B" into "P(AB|C)=P(A|C) only if we've spent less than 10 seconds doing deduction." Or "P(AB|C)=P(A|C) only if it's been less than 106 steps from the basic axioms." These limitations are ugly and nonlocal because they represent the intrusion of nonlocal limitations on our agent into a system that previously ran forever.

Note that the weakening of implication does not necessarily determine the shape of our pool of deduced theorems. A weakened-deduction agent could spiral outward from shortest to longest theorems, or it could search more cleverly to advance on some specific theorems before time runs out.

If a weakened-deduction agent just had the product rule and this new way of translating the axioms into probabilities, it would accumulate some pool of known probabilities - it could work out from the probability-1 axioms to show that some short statements had probability 1 and some other short statements had probability 0. It could also prove some more abstract things like P(AB)=0 without proving anything else about A or B, as long as it followed the right search pattern. But it can't assign probabilities outside of deduction - it doesn't have the rules. So it just ends up with a pool of deduced stuff in the middle of a blank plain of "undefined."

 

Okay, back to referring to E.T. Jaynes (specifically, the bottom of page 32). When deriving the laws of probability from Cox's desiderata, the axioms fall into different groups - there are the "laws of thought" parts, and the "interface" parts. The laws of thought are things like Bayes' theorem, or the product rule. They tell you how probabilities have to fit with other probabilities. But they don't give you probabilities ex nihilo, you have to start with probability-1 axioms or known probabilities and build out from them. The parts that tell you how to get new probabilities are the interface parts, ideas like "if you have equivalent information about two things, they should have the same probability."

So what does our limited-deduction agent do once it reaches its limits of deduction? Well, to put it simply, it uses deduction as much as it can, and then it uses the principle of maximum entropy for the probability of everything else. Maximum entropy corresponds to minimum information, so it satisfies a desideratum like "don't make stuff up."

 

The agent is assigning probabilities to true or false logical statements, statements like S(0)+S(0)=S(S(0)). If it had an unrestricted translation of "A implies B," it could prove this statement quickly. But suppose it can't. Then this statement is really just a string of symbols. The agent no longer "understands" the symbols, which is to say it can only use facts about the probability of these symbols that were previously proved and are within the pool of theorems - it's only a part of an algorithm, and doesn't have the resources to prove everything, so we have to design the agent to assign probabilities based just on what it proved deductively.

So the design of our unlimited-computation, limited-deduction agent is that it does all the deduction it can according to some search algorithm and within some limit, and this can be specified to take any amount of time. Then, to fill up the infinity of un-deduced probabilities, the agent just assigns the maximum-entropy probability distribution consistent with what's proven. For clever search strategies that figure out things like P(AB)=0 without figuring out P(A), doing this assignment requires interpretation of AND, OR, and NOT operations - that is, we still need a Boolean algebra for statements. But our robot no longer proves new statements about probabilities of these symbol strings, in the sense that P(S(0)+0=S(0))=P(S(0)+S(0)=S(S(0))) is a new statement. An example of a non-new statement would be P(S(0)+0=S(0)) AND S(0)+S(0)=S(S(0))) = P(S(0)+0=S(0)) * P(S(0)+S(0)=S(S(0)) | S(0)+0=S(0)) - that's just the product rule, it hasn't actually changed any of the equations.

 

End of part 1 exercise: Can deducing an additional theorem lead to our agent assigning less probability to the right answer under certain situations? (Reading part 2 may help)

 

Okay, now on to doing this with actual bounded resources. And back to the trillionth prime number! You almost forgot about that, didn't you. The plan is to break up the strict deduction -> max entropy procedure, and do it in such a way that our robot can get better results (higher probability to the correct answer) the longer it runs, up to proving the actual correct answer. It starts with no theorems, and figures out the max entropy probability distribution for the end of the trillionth prime number. Said distribution happens to be one-half to everything, e.g. p(1)=1/2 and p(2)=1/2 and p(3)=1/2. The robot doesn't know yet that the different answers are mutually exclusive and exhaustive, much less what's wrong with the answer of 2. But the important thing is, assigning the same number to everything of interest is fast. Later, as it proves relevant theorems, the robot updates the probability distribution, and when it runs out of resources it stops.

Side note: there's also another way of imagining how the robot stores probabilities, used in Benja's post, which is to construct a really big mutually exclusive and exhaustive basis (called "disjunctive normal form"). Instead of storing P(A) and P(B), which are not necessarily mutually exclusive or exhaustive, we store P(AB), P(A¬B) (the hook thingy means "NOT"), P(¬AB), and P(¬A¬B), which are mutually exclusive and exhaustive. These things would then each have probability 1/4, or 1/2N, where N is the number of statements you're assigning probabilities to. This is a pain when N goes to infinity, but can be useful when N is approximately the number of possible last digits of a number.

Back on track: suppose the first thing the robot proves about the last digit of the trillionth prime number is that answers of 1, 2, 3, 4, 5, 6, 7, 8, 9, and 0 are exhaustive. What does that do to the probabilities? In disjunctive normal form, the change is clear - exhaustiveness means that P(¬1¬2¬3¬4¬5¬6¬7¬8¬9¬0)=0, there's no leftover space. Previously there were 210=1024 of these disjunctive possibilities, now there are 1023, and the remaining ones stay equivalent in terms of what's been proven about them (nothing), so the probability of each went from 1/1024 to 1/1023. Two things to note: figuring this out took a small amount of work and is totally doable for the robot, but we don't want to do this work every time we use modus tollens, so we need to have some way to tell whether our new theorem matters to the trillionth prime number.

For example, image we were interested in the statement A. The example is to learn that A, B, and C are mutually exclusive and exhaustive, step by step. First, we could prove that A, B, C are exhaustive - P(¬A¬B¬C)=0. Does this change P(A)? Yes, it changes from 4/8 (N is 3, so 23=8) to 4/7. Then we learn that P(AB)=0, i.e. A and B are mutually exclusive. This leaves us only A¬BC, ¬ABC, A¬B¬C, ¬AB¬C, and ¬A¬BC. P(A) is now 2/5. Now we learn that A and C are mutually exclusive, so the possibilities are ¬ABC, A¬B¬C, ¬AB¬C, and ¬A¬BC. P(A)=1/4. Each of the steps until now have had the statement A right there inside the parentheses - but for the last step, we show that B and C are mutually exclusive, P(BC)=0, and now we just have P(A)=P(B)=P(C)=1/3. We just took a step that didn't mention A, but it changed the probability of A. This is because we'd previously disrupted the balance between ABC and ¬ABC. To tell when to update P(A) we not only need to listen for A to be mentioned, we have to track what A has been entangled with, and what's been entangled with that, and so on in a web of deduced relationships.

 

The good news is that that's it. The plausibility assigned to any statement A by this finite-computation method is the same plausibility that our computationally-unlimited deductively-limited agent would have assigned to it, given the same pool of deduced theorems. The difference is just that the limited-deduction agent did this for every possible statement, which as mentioned doesn't make as much sense in disjunctive normal form.

So IF we accept that having limited resources is like having a limited ability to do implication, THEN we know how our robot should assign probabilities to a few statements of interest. It should start with the good old "everything gets probability 1/2," which should allow it to win some cookies even if it only has a few milliseconds, and then it should start proving theorems, updating its probabilities when it proves something that should impact those probabilities.

 

Now onto the last part. The robot's utility function wasn't really designed for U(last digit of trillionth prime number is 1), so what should it do? Well, what does our robot like? It likes having a cookie over not having a cookie. C is for cookie, and that's good enough for it. So we want to transform a utility over cookies into a an expected utility that will let us order possible actions.

We have to make the exact same transformation in the case of ordinary probabilities, so let's examine that. If I flip a coin and get a cookie if I call it correctly, I don't have a terminal U(heads) or U(tails), I just have U(cookie). My expected utility of different guesses comes from not knowing which guess leads to the cookie.

Similarly, the expected utility of different guesses when betting on the trillionth prime number comes from not knowing which guess leads to the cookie. It is possible to care about the properties of math, or to care about whether coins land heads or tails, but that just means we have to drag in causality - your guess doesn't affect how math works, or flip coins over.

 

So the standard procedure for our robot looks like this:

Start with some utility function U over the world, specifically cookies.

Now, face a problem. This problem will have some outcomes (possible numbers of cookies), some options (that is, strategies to follow, like choosing one of 10 possible digits), and any amount of information about how options correspond to outcomes (like "iff the trillionth prime ends with this digit, you get the cookie").

Now our robot calculates the limited-resources probability of getting different outcomes given different strategies, and from that calculates an expected utility for each strategy.

Our robot then follows one of the strategies with maximum expected utility.

 

Bonus exercises: Does this procedure already handle probabilistic maps from the options to the outcomes, like in the case of the flipped coin? How about if flipping a coin isn't already converted into a probability, but is left as an underdetermined problem a la "a coin (heads XOR tails) is flipped, choose one."

35 comments

Comments sorted by top scores.

comment by Wei Dai (Wei_Dai) · 2013-01-14T11:46:55.601Z · LW(p) · GW(p)

Thanks for writing this, I found it helpful for understanding the motivation behind Benja's proposal and to develop some intuitions about it. I'd like to see some explanation of how statements involving quantifiers are supposed to work in the proposal. Here's my understanding, and please let me know if it's correct.

A statement with a "for all" quantifier is also supposed to start with 1/2 probability. Consider "For all X, Q(X)" for some equation Q. This implies Q(0) and Q(1), etc., each of which also starts with 1/2 probability. What happens as the robot proves individual implications of the form "(For all X, Q(X)) implies Q(0)"? The probability of the quantified statement is reduced by half each time. Proving that individual statements like Q(0) are true can only bring the probability of the quantified statement back up to 1/2 but not above that.

Since proving Q(i) tends to be harder (take longer) than proving "(For all X, Q(X)) implies Q(i)", probability of the quantified statement goes to 0 as we increase the time/length limit even if all of the individual Q(i) are provable. This doesn't seem to be desirable, if my understanding is correct.

Replies from: Manfred, Manfred, alex_zag_al
comment by Manfred · 2013-08-12T14:17:48.228Z · LW(p) · GW(p)

Oh, since I was recently thinking about this, I figured I'd write down how the robot actually ends up raising probabilities above 1/2 (without just going to 1): if we can make the probability of Q go down, then we just have to make the probability of ¬Q go down the same way. The probability of Q goes down when we learn that Q implies X but we don't see X. The probability of Q goes up when we learn that ¬Q implies Y but we don't see Y.

comment by Manfred · 2013-01-14T13:00:54.141Z · LW(p) · GW(p)

Hm. Yeah, you're right. The probability goes like 1/(1+2^number of things implied). You'd think that "3 is prime, 5 is prime, 7 is prime" would at least make the robot guess that all odd numbers were prime.

On the other hand, if we think of all possible statements "for all X, Q(X)", there are more was to be false than true. Infinity more ways, even, and this is directly related to how many things are implied by the statement. So in that way, the robot is functioning exactly as intended and assigning the maximum entropy value.

Which makes this problem sort of like my suspicions about magfrump's idea below - related to the fact that we expect simple things, we don't actually expect maximum entropy - and we're often right.

Replies from: private_messaging, private_messaging
comment by private_messaging · 2013-01-17T12:09:10.885Z · LW(p) · GW(p)

I thought of it some more... Don't think of probability as a way to represent ignorance. Probability represents a kind of knowledge. For example when you say "probability that coin fell heads is 1/2 ", that represents a sort of a model of what bounces do to the coin direction. When you multiply probabilities for a pair of coin throws, that represents independence of the throws. The 1/2 for Q(i) is not representative of that, and even if it was representative of something (statistics on the Q), product of these numbers is not probability but lower bound on the probability assuming independence; assuming they aren't independent, the probability is still 1/2 .

Probabilistic methods certainly are very useful in time bounded calculations - they are used very heavily in computer graphics and simulations of all kinds - but that is usually through use of random number source or sufficiently good random number generator, as in the probabilistic primality tests such as Miller-Rabin for example. The time-bounded bot, far from evaluating and updating "probabilities", would have to do symbolic mathematics searching for an imperfect solution, and make use of random number sources whenever appropriate, striving for bounds on probability of being wrong. (e.g. when I implemented Miller-Rabin I just made it run enough calculations so that it wouldn't fail till heat death of universe assuming that my prng wouldn't somehow return a huge string of bad witnesses, which would be very interesting if it happened).

Replies from: Manfred
comment by Manfred · 2013-01-17T14:53:08.701Z · LW(p) · GW(p)

I thought of it some more... Don't think of probability as a way to represent ignorance. Probability represents a kind of knowledge.

I'll be specific. Shannon entropy represents ignorance. The bigger it is, the more ignorant you are.

Replies from: private_messaging
comment by private_messaging · 2013-01-17T16:15:05.597Z · LW(p) · GW(p)

Well, that doesn't seem very useful as per Wei Dai example...

Why not look at the methods people actually employ to approximate and guess math when they can't quite compute something? Applied mathematics is an enormously huge field.

comment by private_messaging · 2013-01-17T06:35:35.727Z · LW(p) · GW(p)

On the other hand, if we think of all possible statements "for all X, Q(X)", there are more was to be false than true. Infinity more ways, even,

If you consider all syntactically valid Q in some sort of math notation, no matter the length, there's the fraction that is simply 0=0*( ........... X somewhere here........) , or your favourite less trivial statement of choice. Ditto for Turing machine tapes et cetera. There's certainly a nonzero probability of constructing Q(X) that holds for all X.

Replies from: Manfred
comment by Manfred · 2013-01-17T14:49:33.158Z · LW(p) · GW(p)

That is a more detailed model than the robot uses. What it means by the ways to be false or true is more like

"true for 1, true for 2, true for 3, false for 4, true for 5..." The robot can't look inside the statements while it's doing probabilistic logic, it can only look at truth values and relationships.

On the other hand, the power of doing that is certainly a good reason to upgrade the robot :)

comment by alex_zag_al · 2014-07-31T15:44:11.767Z · LW(p) · GW(p)

This is counterintuitive in an interesting way.

You'd think that since P(Q1|~∀xQx) = 1/2 and P(Q1|∀xQx) = 1, observing Q1 is evidence in favor of ∀xQx.

And it is, but the hidden catch is that this depends on the implication that ∀xQx->Q1, and that implication is exactly the same amount of evidence against ∀xQx.

It's also an amusing answer to the end of part 1 exercise.

comment by magfrump · 2013-01-13T23:27:27.912Z · LW(p) · GW(p)

I feel like the example of digits of prime numbers is a little bit leading here, because prime numbers are actually equidistributed in a technical sense, and sufficiently firmly that if you pretend that everything works like probabilities, you end up getting the right answers and producing deep theories.

There are other situations in which there aren't naive probabilistic models which are actually highly predictive, and even if it's clear to me how to translate your reasoning into the specific example presented, it's not clear to me how to translate it to something about, say, orders of exceptional finite simple groups or smooth structures on real vector spaces, which are both amenable to probabilistic answers but have answers that don't look like probability distributions.

Replies from: Manfred
comment by Manfred · 2013-01-14T00:14:03.068Z · LW(p) · GW(p)

prime numbers are actually equidistributed in a technical sense

The last digit of the trillionth prime number is 3, which is about as non-equidistributed as you can get. That's the right answer. Anything else is just a way for the robot to confess its ignorance gracefully.

EDIT: although one trouble might be that this robot isn't equipped to handle improper distributions. So if you hand it an infinitude of finite simple groups and tell it to choose one, it assigns everything a probability of zero and chooses according to whatever algorithm it uses to choose between things that have identical utility.

Replies from: magfrump, AlexMennen
comment by magfrump · 2013-01-14T02:41:35.737Z · LW(p) · GW(p)

Yes but since "trillionth prime" isn't computable, the question translates to "some high prime" and among the set of odd primes, the ones digits are equidistributed over the set {1,3,7,9}. I agree with the point that this type of question can be attacked as though it were a probabilistic question.

My point is that heuristic reasoning WORKS on primes, this is a theorem. If you say "I don't know the answer, but I will use heuristic reasoning to make a guess" and you ask for the ones digits of the trillionth, trillion and third, and two-trillion-and-seventeenth primes, you expect to get three numbers picked at random from that set, and guessing this way will serve you well.

There are other questions where you might think to apply heuristic reasoning, such as the ones digit of the number of isomorphism classes of finite groups of a given integer order, which like the question about primes is a function from the natural numbers to numbers 0-9, but I do not believe that it gives you a function which anything reasonable is known about; it doesn't converge to a distribution on the set of digits, it just acts weird and changes drastically at unpredictable points.

Replies from: private_messaging, Manfred, Manfred
comment by private_messaging · 2013-01-14T12:12:19.746Z · LW(p) · GW(p)

One can make a slightly better guess, though. The number of digits in the trillionth prime can be found via some approximate prime counting rule, then the distribution of sums of all digits but last can be estimated, then the divisibility-by-3 rule over sum of all digits makes the last digits very slightly not equiprobable with regards to divisibility by 3. If you use algebra cleverly and avoid any unnecessary computations (such as computation of actual 'probabilities' whenever those are unnecessary, trying to answer question of A>B algebraically rather than arithmetically), you might be able to do that in time.

edit: or much more simply, if you have an upper bound and a lower bound on the value of the prime, then within this range the last digits are not equidistributed. Albeit it feels to me that the choice of the last digit will depend entirely to which method comes to mind of our agent, and in turn, the agent will appear irrational (priming - the choice of the digit will depend to what ideas the agent has available), and a malicious third party that knows the answer and wants our agent deprived of cookies, could talk the agent into making a wrong guess.

edit: this is quite interesting topic, actually. If you know that different approximation methods will yield different last digits, and you assume that the choice of an approximation method is random, then you should further scale down the utility difference to approximate the sum (average) over all methods.

Though, I don't see anything between a very very slightly better guess, and the correct answer. And if the very very slightly better guess doesn't return 3, you are guaranteed to lose.

Replies from: magfrump
comment by magfrump · 2013-01-14T18:49:29.222Z · LW(p) · GW(p)

If you try to count primes IN ANY WAY, then you will end up with the answer that YOU HAVE AN EQUAL CHANCE OF GETTING ANY COPRIME RESIDUE MODULO 10; this is a THEOREM. It is a proven fact of nature, or platonic ideals, or whatever it is theorems are facts of. Measure mod 3 will just tell you that the residues are also equidistributed modulo 3; it will reduce to computing residues modulo 30, which will not give you any information whatsoever because there will be an equal number that reduce to 1,3,7, and 9 modulo 10.

If you have an upper and lower bound obviously primes aren't EXACTLY equidistributed, but in this case your upper and lower bounds will be away by millions, with primes showing up on average every 27 digits. The probabilities will work out, if you think of this as a random question selected from many possible and indistinguishable questions.

In real life if I ask you "how tall is the tallest Sequoia tree?" versus "how tall is the tallest scott's pine?" Even though both questions have specific answers, if you don't have wikipedia handy, you'll need to default to "how tall do evergreens get" and the answer becomes a probability distribution. The same is true of prime numbers, once you blur out the details as much as they already have been blurred out. So the "correct answer" in a Bayesian sense is the well-calibrated answer, where you have a distribution and confidence intervals that are of known accuracy over many similar questions, even if you get the wrong answer sometimes. This sort of reasoning is totally transparent with respect to one specific mathematical phenomenon; the distribution of prime numbers in a range.

It may not be transparent with respect to other mathematical problems.

Replies from: private_messaging
comment by private_messaging · 2013-01-15T01:02:11.832Z · LW(p) · GW(p)

Equidistribution, of course, doesn't imply "equal chances" for Nth prime. It's 3. If you don't have time to calculate 3, there's no theorem saying you can't conclude something (trivially, in edge cases where you have almost enough time you might be able to exclude 1 somehow). Other interesting thing is that 2^n-1 are rarely primes (n has to be prime and then still usually not), and we know that 2^n-1 don't end in 9 ; impact of this decreases as n grows though. All sorts of small subtle things going on. (And yes, the resulting difference in "probabilities" is very small).

Replies from: magfrump
comment by magfrump · 2013-01-15T01:29:15.518Z · LW(p) · GW(p)

If you were able to conclude additional information about the one trillionth prime OTHER THAN that it is a large prime number which is approximately of size 27 trillion, then that information MAY contain information about that prime modulo 10, I agree.

I would be very surprised (p = .01) if there actually are results of the form "the nth prime has such and such characteristics modulo whatever" because of the strong equidistribution results that do exist, and obviously after someone computes the answer it is known. If you look at prime numbers and apply techniques from probability theory, it does work, and it works beautifully well. Adding information beyond "high prime" would allow you to apply techniques from probability theory to deal with logical uncertainty, and it would work well. The examples you give seem to be examples of this.

My core point is that other problems may be less susceptible to approaches from probability theory. Even if looking at digits of prime numbers is not a probability theory problem, using techniques from probability theory accesses information from theorems you may not have proven. Using techniques from probability theory elsewhere does not do this, because those theorems you haven't proven aren't even true. So I am concerned, when someone purports to solve the problem of logical uncertainty, that their example problem is one whose solution looks like normal uncertainty.

I don't think we disagree on any matters of fact; I think we may disagree about the definition of the word "probability."

Replies from: private_messaging
comment by private_messaging · 2013-01-15T05:32:01.502Z · LW(p) · GW(p)

Yes, I agree. In other problems, your "probabilities" are going to not be statistically independent from basic facts of mathematics. I myself posted a top level comment with regards to ultimate futility of 'probabilistic' approach. Probability is not just like logic. If you have graph with loops or cycles, it is incredibly expensive. It isn't some reals flowing through network of tubes, sides of a loop are not statistically independent. It doesn't cut down your time at all, except in extreme examples (I once implemented a cryptographic algorithm dependent on Miller–Rabin primality test, which is "probabilistic", and my understanding is that this is common in cryptography and is used by your browser any time you establish a SSL connection)

Replies from: magfrump
comment by magfrump · 2013-01-15T08:22:21.981Z · LW(p) · GW(p)

Okay yes we are in agreement.

comment by Manfred · 2013-01-14T05:18:29.111Z · LW(p) · GW(p)

I think you may be mixing probability with frequency here.

If you say "I don't know the answer, but I will use heuristic reasoning to make a guess" and you ask for the ones digits of the trillionth, trillion and third, and two-trillion-and-seventeenth primes, you expect to get three numbers picked at random from that set, and guessing this way will serve you well.

You should rather say that they're independent if you're computationally limited. But that independence doesn't even matter for this post. I think you're assigning too many things to the word "random." Probability is not about random, it's about ignorance.

EDIT: speaking of ignorance, I probably made this comment in ignorance of what you actually intended, see more recent comment.

comment by Manfred · 2013-01-14T07:26:04.053Z · LW(p) · GW(p)

Whoops. Well, I may have figured out what you mean. But just in case, would you be willing to try and explain what sort of extra reasoning the robot should be doing in certain cases as if I had no idea what you meant?

Replies from: magfrump
comment by magfrump · 2013-01-14T18:55:51.638Z · LW(p) · GW(p)

What I'm saying is that the robot will actually do well in the given example, because when you use Bayesian inference on primes it works for some reason.

Other questions in math have weird answers; for example, for every natural number n, the number of smooth structures on R^n is exactly 1... except for n=4, in which case there are uncountably many.

Probability distributions can continuously approximate this, but only with a lot of difficulty and inaccuracy. It's an answer that isn't going to nicely appear out of doing Bayesian thinking. Once an agent has seen one dimension with an infinite number, they'll be hard pressed to accept that no other dimension has more than 1... or seeing an infinite number with 1, how will they accept one with infinitely many? Basically your priors have to be weirdly formalized. This may be a problem that has been addressed.

So my point is that just "throw Bayes at it" is a totally reasonable way of making probabilistic estimates for theorems about countable and compact sets. It's just that that method of reasoning will vary wildly in how useful it is, because some mathematical results look like probabilities and some don't.

I don't have idea for extra reasoning processes I just want to urge caution that the problem isn't solved because it can handle some situations.

Replies from: Manfred
comment by Manfred · 2013-01-14T19:40:17.072Z · LW(p) · GW(p)

Hm. I think you think the robot works differently than it actually does.

My guess for what you were going to say was that you wanted the robot to work in this different way, but nopes.

A robot that predicts the next number in a series is something like a minimum message-length predictor. The robot outlined here actually can't do that.

Instead, the robot attacks each problem based on theorems that require or rule out different combinations of relevant possibilities. So, for example, in the question "how many smooth structures are there in n dimensions" (thank you for making that example clear btw), the robot would for each separate case try to prove what was going on, and if it failed, it would likely do something dumb (like run out of time, or pick the first thing it hadn't proven a theorem about) because it doesn't handle infinity-option problems very well. What it wouldn't do is try and predict the answer based on the answer for smaller dimensions, unless it could prove theorems connecting the two.

Replies from: magfrump
comment by magfrump · 2013-01-15T01:08:10.300Z · LW(p) · GW(p)

Okay, but say the robot in ten seconds manages to prove the theorem: "For all dimensions n not equal to 4, there is exactly one smooth structure on R^n." But is unable to produce a result regarding n=4? Or more realistically, say it is able to construct 1 smooth structure on R^n for any n, and can prove that no others exist unless n=4. How does it make guesses in the case n=4?

Replies from: magfrump
comment by magfrump · 2013-01-15T01:27:31.752Z · LW(p) · GW(p)

If we want to live in the least convenient possible world assume that in second 1 it constructs the smooth structures; it takes three seconds to prove that there are no more for n>5, three seconds to prove no more for n=1,2, and three more seconds to prove there are no more for n=3 and runs out of time. These results are obtained incidentally from inequalities that arise when pursuing a proof for n=4, which is the central value of some equation at the core of the proofs. (so the proofs really say "if another smooth structure exists, it exists for n<5, 2<n<5, 3<n<5.")

Replies from: Manfred
comment by Manfred · 2013-01-15T09:35:43.562Z · LW(p) · GW(p)

If it really can't prove any theorems that directly include the translation of "the number of smooth structures for n=4 is," it simply won't ever update that.

Replies from: magfrump
comment by magfrump · 2013-01-15T19:21:16.883Z · LW(p) · GW(p)

Well it can prove "the number of structures for n=4 is at least 1."

comment by AlexMennen · 2013-01-14T17:35:02.271Z · LW(p) · GW(p)

one trouble might be that this robot isn't equipped to handle improper distributions. So if you hand it an infinitude of finite simple groups and tell it to choose one, it assigns everything a probability of zero and chooses according to whatever algorithm it uses to choose between things that have identical utility.

I don't think this is too hard to fix. The robot could have a unit of improper prior (epsilon) that it remembers is larger than zero but smaller than any positive real.

Of course, this doesn't tell you what to do when asked to guess whether the order of the correct finite simple group is even, which might be a pretty big drawback.

Replies from: Manfred
comment by Manfred · 2013-01-14T18:18:29.101Z · LW(p) · GW(p)

For the robot as described, this will actually happen (sort of like Wei Dai's comment - I'm learning a lot from discussing with you guys :D ) - it only actually lowers something's probability once it proves something about it specifically, so it just lowers the probability of most of its infinite options by some big exponential, and then, er, runs out of time trying to pick the option with highest utility. Okay, so there might be a small flaw.

comment by private_messaging · 2013-01-13T12:17:27.985Z · LW(p) · GW(p)

A more effective robot only looks for the most probable digit, it doesn't need to know how probable that digit is (edit: that is, in this specific problem where utilities are equal. There is nothing ad-hoc about use of algebra). E.g. if it would figure out that for some reason the large primes are most likely to end in 3 (or rather, that no digit is more likely than 3), it does not even need to know that 0,2,4,5,6,8 are not an option.

Furthermore, as you lower the prime, there has to be a seamless transition to the robot actually calculating the last digit in an efficient manner.

A bounded agent has to rationally allocate it's computing time. Calculating direct probabilities or huge sets of relational probabilities - rather than minimum necessary - is among the least rational things that can be done with the computing time. (Even worse thing to do could be a comparison between two estimates of utility which have different errors, such as two incomplete sums). Outside very simple toy problems, probabilistic reasoning is incredibly computationally expensive, as often every possible combination of variables has to be considered.

Replies from: Manfred
comment by Manfred · 2013-01-13T13:44:34.095Z · LW(p) · GW(p)

A more effective robot only looks for the most probable digit

I agree with the rest of your comment, but this seems too ad hoc. It runs into trouble if outcomes differ in utility, so that you can't just look for high probability. And storing a number seems like a much better way of integrating lots of independent pieces of information than storing a list.

Replies from: private_messaging
comment by private_messaging · 2013-01-13T16:23:48.155Z · LW(p) · GW(p)

Then you look for largest probability*utility , which you generally do by trying to find a way to demonstrate A>B which you can do in many cases where you can't actually evaluate either A or B (and many cases where you can only evaluate A and B so inaccurately that outcome of comparison of evaluations of A and B is primarily dependent on inaccuracies).

Furthermore, a "probability" is a list due to loss of statistical independence with other variables. edit: the pieces of information are very rarely independent, too. Some reasoning that 3 is more likely than other digits would not be independent from 2 being a bad choice.

edit: also, holy hell, trillionth prime does end with 3.

comment by AlexMennen · 2013-01-14T19:22:43.429Z · LW(p) · GW(p)

This is cool, but seems underspecified. Could you write a program that carries out this reasoning with respect to a fairly broad class of problems?

If you have an inconsistent probability distribution, the result you get when you resolve them depends on the order in which you resolve them. For example, given P(A)=1/2. P(B)=1/2, P(AB)=1/2, and P(A¬B)=1/2, you could resolve this by deriving P(A¬B)=0 from the first 3, or by deriving P(AB)=0 from the first 2 and last 1. Both of these answers seem obviously wrong. Does your method have a consistent way of resolving that sort of problem?

Replies from: Manfred
comment by Manfred · 2013-01-14T20:43:18.392Z · LW(p) · GW(p)

A nice way of thinking about it is that the robot can do unlimited probabilistic logic, but it only takes finite time because it's only working from a finite pool of proven theorems. When doing the probabilistic logic, the statements (e.g. A, B) are treated as atomic. So you can have effective inconsistencies, in that you can have an atom that says A, and an atom that says B, and an atom that effectively says 'AB', and unluckily end up with P('AB')>P(A)P(B). But you can't know you have inconsistencies in any way that would lead to mathematical problems. Once you prove that P('AB') = P(AB), where removing the quotes means breaking up the atom into an AND statement, then you can do probabilistic logic on it, and the maximum entropy distribution will no longer be effectively inconsistent.

Replies from: AlexMennen
comment by AlexMennen · 2013-01-14T20:53:52.708Z · LW(p) · GW(p)

Oh, I see. Do you know whether you can get different answers by atomizing the statements differently. For instance, will the same information always give the same resulting probabilities if the atoms are A and B as it would if the atoms are A and A-xor-B?

P('AB')>P(A)P(B)

Not a problem if A and B are correlated. I assume you mean P('AB')>min(P(A), P(B))?

Replies from: Manfred
comment by Manfred · 2013-01-15T09:45:23.231Z · LW(p) · GW(p)

Ah, right. Or even P('AB')>P(A).

You can't get different probabilities by atomizing things differently, all the atoms "already exist." But if you prove different theorems, or theorems about different things, then you can get different probabilities.