Posts
Comments
Keynes in his "Treatise on probability" talks a lot about analogies in the sense you use it here, particularly in "part 3: induction and analogy". You might find it interesting.
Hmm, thanks. Seems similar to my description above, though as far as I can tell it doesn't deal with my criticisms. It is rather evasive when it comes to the question of what status models have in Bayesian calculations.
I am curious; what is the general LessWrong philosophy about what truth "is"? Personally I so far lean towards accepting an operational subjective Bayesian definition, i.e. the truth of a statement is defined only so far as we agree on some (in principle) operational procedure for determining its truth; that is we have to agree on what observations make it true or false.
For example "it will rain in Melbourne tomorrow" is true if we see it raining in Melbourne tomorrow (trivial, but also means that the truth of the statement doesn't depend on rain being "real", or just a construction of Descartes' evil demon or the matrix, or a dream, or even a hallucination). It is also a bit disturbing because the truth of "the local speed of light is a constant in all reference frames" can never be determined in such a way. We could go to something like Popper's truthlikeness, but then standard Bayesianism gets very confusing, since we then have to worry about the probability that a statement has a certain level of "truthlikeness", which is a little mysterious. Truthlikeness is nice in how it relates to the map-territory analogy though.
I am inclined to think that standard Bayesian style statements about operationally-defined things based on our "maps" makes sense, i.e. "If I go and measure how long it takes light to travel from the Earth to Mars, the result will be proportional to c" (with this being influenced by the abstraction that is general relativity), but it still remains unclear to me precisely what this means, in terms of Bayes theorem that is: i.e. the probability P("measure c" | "general relativity") implies that P("general relativity") makes sense somehow, though the operational criteria cannot be where its meaning comes from. In addition we must somehow account for that fact "general relativity" is strictly False, in the "all models are wrong" sense, so we need to somehow rejig that proposition into something that might actually be true, since it makes no sense to condition our beliefs on things we know to be false.
I suppose we might be able to imagine some kind of super-representation theorem, in the style of de-Finetti, in which we show that degrees of belief in operational statements can be represented as the model average of the predictions from all computable theories, hoping to provide an operational basis for Solomonoff induction, but actually I am still not 100% sure what de-Finetti's usual representation theorem really means. We can behave "as if" we had degrees of belief in these models weighted by some prior? Huh? Does this mean we don't really have such degrees of belief in models but they are a convenient fiction? I am very unclear on the interpretation here.
The map-territory analogy does seem correct to me, but I find it hard to reconstruct ordinary Bayesian-style statements via this kind of thinking...
Lol that is a nice story in that link, but it isn't a Dutch book. The bet in it isn't set up to measure subjective probability either, so I don't really see what the lesson in it is for logical probability.
Say that instead of the digits of pi, we were betting on the contents of some boxes. For concreteness let there be three boxes, one of which contains a prize. Say also that you have looked inside the boxes and know exactly where the prize is. For me, I have some subjective probability P( X_i | I_mine ) that the prize is inside box i. For you, all your subjective probabilities are either zero or one, since you know perfectly well where the prize is. However, if my beliefs about where the prize is follow the probability calculus correctly, you still cannot Dutch book me, even though you know where the prize is and I don't.
So, how is the scenario about the digits of pi different to this? Do you have some example of an actual Dutch book that I would accept if I were to allow logical uncertainty?
edit:
Ok well I thought of what seems to be a typical Dutch book scenario, but it has made me yet more confused about what is special about the logical uncertainty case. So, let me present two scenarios, and I wonder if you can tell me what the difference is:
Consider two propositions, A and B. Let it be the case that A->B. However, say that we do not realise this, and say we assign the following probabilities to A and B:
P(A) = 0.5
P(B) = 0.5
P(B|A) = P(B)
P(A & B) = 0.25
indicating that we think A and B are independent. Based on these probabilities, we should accept the following arrangement of bets:
Sell bet for $0.50 that A is false, payoff $1 if correct
Sell bet for $0.25 that A & B are both true, payoff $1 if correct
The expected amount we must pay out is 0.5$1 + 0.25$1 = $0.75, which is how much we are selling the bets for, so everything seems fair to us.
Someone who understands that A->B will happily buy these bets from us, since they know that "not A" and "A & B" are actually equivalent to "not A" and "A", i.e. he knows P(not A) + P(A & B) = 1, so he wins $1 from us no matter what is the case, making a profit of $0.25. So that seems to show that we are being incoherent if we don't know that A->B.
But now consider the following scenario; instead of having the logical relation that A->B, say that our opponent just has some extra empirical information D that we do not, so that for him P(B|A,D) = 1. For him, then, he would still say that
P(not A | D) + P(A & B | D) = P(not A | D) + P(B|A,D)*P(A|D) = P(not A|D) + P(A|D) = 1
so that we, who do not know D, could still be screwed by the same kind of trade as in the first example. But then, this is sort of obviously possible, since having more information than your opponent should give you a betting advantage. But both situations seem equivalently bad for us, so why are we being incoherent in the first example, but not in the second? Or am I still missing something?
That sounds to me more like an argument for needing lower p-values, not higher ones. If there are many confounding factors, you need a higher threshold of evidence for claiming that you are seeing a real effect.
Physicists need low p-values for a different reason, namely that they do very large numbers of statistical tests. If you choose p=0.05 as your threshold then it means that you are going to be claiming a false detection at least one time in twenty (roughly speaking), so if physicists did this they would be claiming false detections every other day and their credibility would plummet like a rock.
Is there any more straightforward way to see the problem? I argued with you about this for a while and I think you convinced me, but it is still a little foggy. If there is a consistency problem, surely this means that we must be vulnerable to Dutch books doesn't it? I.e. they would not seem to be Dutch books to us, with our limited resources, but a superior intelligence would know that they were and would use them to con us out of utility. Do you know of some argument like this?
Very well, then i will wait for the next entry. But i thought the fact that we were explicitly discussing things the robot could not compute made it clear that resources were limited. There is clearly no such thing as logical uncertainty to the magic logic god of the idealised case.
No we aren't, we're discussing a robot with finite resources. I obviously agree that an omnipotent god of logic can skip these problems.
It was your example, not mine. But you made the contradictory postulate that P("wet outside"|"rain")=1 follows from the robots prior knowledge and the probability axioms, and simultaneously that the robot was unable to compute this. To correct this I alter the robots probabilities such that P("wet outside"|"rain")=0.5 until such time as it has obtained a proof that "rain" correlates 100% with "wet outside". Of course the axioms don't determine this; it is part of the robots prior, which is not determined by any axioms.
You haven't convinced nor shown me that this violates Cox's theorem. I admit I have not tried to follow the proof of this theorem myself, but my understanding was that the requirement you speak of is that the probabilistic logic reproduces classical logic in the limit of certainty. Here, the robot is not in the limit of certainty because it cannot compute the required proof. So we should not expect to get the classical logic until updating on the proof and achieving said certainty.
You haven't been very specific about what you think I'm doing incorrectly so it is kind of hard to figure out what you are objecting to. I corrected your example to what I think it should be so that it satisfies the product rule; where's the problem? How do you propose that the robot can possibly set P("wet outside"|"rain")=1 when it can't do the calculation?
Ok sure, so you can go through my reasoning leaving out the implication symbol, but retaining the dependence on the proof "p", and it all works out the same. The point is only that the robot doesn't know that A->B, therefore it doesn't set P(B|A)=1 either.
You had "Suppose our robot knows that P(wet outside | raining) = 1. And it observes that it's raining, so P(rain)=1. But it's having trouble figuring out whether it's wet outside within its time limit, so it just gives up and says P(wet outside)=0.5. Has it violated the product rule? Yes. P(wet outside) >= P(wet outside and raining) = P(wet outside | rain) * P(rain) = 1."
But you say it is doing P(wet outside)=0.5 as an approximation. This isn't true though, because it knows that it is raining, so it is setting P(wet outside|rain) = 0.5, which was the crux of my calculation anyway. Therefore when it calculates P(wet outside and raining) = P(wet outside | rain) * P(rain) it gets the answer 0.5, not 1, so it is still being consistent.
Hmm this does not feel the same as what I am suggesting.
Let me map my scenario onto yours:
A = "raining"
B = "wet outside"
A->B = "It will be wet outside if it is raining"
The robot does not know P("wet outside" | "raining") = 1. It only knows P("wet outside" | "raining", "raining->wet outside") = 1. It observes that it is raining, so we'll condition everything on "raining", taking it as true.
We need some priors. Let P("wet outside") = 0.5. We also need a prior for "raining->wet outside", let that be 0.5 as well. From this it follows that
P("wet outside" | "raining") = P("wet outside" | "raining", "raining->wet outside") P("raining->wet outside"|"raining") + P("wet outside" | "raining", not "raining->wet outside") P(not "raining->wet outside"|"raining") = P("raining->wet outside"|"raining") = P("raining->wet outside") = 0.5
according to our priors [first and second equalities are the same as in my first post, third equality follow since whether or not it is "raining" is not relevant for figuring out if "raining->wet outside"].
So the product rule is not violated.
P("wet outside") >= P("wet outside" and "raining") = P("wet outside" | "raining") P("raining") = 0.5
Where the inequality is actually an equality because our prior was P("wet outside") = 0.5. Once the proof p that "raining->wet outside" is obtained, we can update this to
P("wet outside" | p) >= P("wet outside" and "raining" | p) = P("wet outside" | "raining", p) P("raining" | p) = 1
But there is still no product rule violation because
P("wet outside" | p) = P("wet outside" | "raining", p) P("raining" | p) + P("wet outside" | not "raining", p) P(not "raining" | p) = P("wet outside" | "raining", p) P("raining" | p) = 1.
In a nutshell: you need three pieces of information to apply this classical chain of reasoning; A, B, and A->B. All three of these propositions should have priors. Then everything seems fine to me. It seems to me you are neglecting the proposition "A->B", or rather assuming its truth value to be known, when we are explicitly saying that the robot does not know this.
edit: I just realised that I was lucky for my first inequality to work out; I assumed I was free to choose any prior for P("wet outside"), but it turns out I am not. My priors for "raining" and "raining->wet outside" determine the corresponding prior for "wet outside", in order to be compatible with the product rule. I just happened to choose the correct one by accident.
But it turns out that there is one true probability distribution over mathematical statements, given the axioms. The right distribution is obtained by straightforward application of the product rule - never mind that it takes 4^^^3 steps - and if you deviate from the right distribution that means you violate the product rule at some point.
This does not seem right to me. I feel like you are sneakily trying to condition all of the robots probabilities on mathematical proofs that it does not have a-priori. E.g. consider A, A->B, therefore B. To learn that P(A->B)=1, the robot has to do a big calculation to obtain the proof. After this, it can conclude that P(B|A,A->B)=1. But before it has the proof, it should still have some P(B|A)!=1.
Sure, it seems tempting to call the probabilities you would have after obtaining all the proofs of everything the "true" probabilties, but to me it doesn't actually seem different to the claim that "after I roll my dice an infinity of times, I will know the 'true' probability of rolling a 1". I should still have some beliefs about a one being rolled before I have observed vast numbers of rolls.
In other words I suggest that proof of mathematical relationships should be treated exactly the same as any other data/evidence.
edit: in fact surely one has to consider this so that the robot can incorporate the cost of computing the proof into its loss function, in order to decide if it should bother doing it or not. Knowing the answer for certain may still not be worth the time it takes (not to mention that even after computing the proof the robot may still not have total confidence in it; if it is a really long proof, the probability that cosmic rays have caused lots of bit-flips to mess up the logic may become significant. If the robot knows it cannot ever get the answer with sufficient confidence within the given time constraints, it must choose an action which accounts for this. And the logic it uses should be just the same as how it knows when to stop rolling die).
edit2: I realised I was a little sloppy above; let me make it clearer here:
The robot knows P(B|A,A->B)=1 apriori. But it does not know "A->B" is true apriori. It therefore calculates
P(B|A) = P(B|A,A->B) P(A->B|A) + P(B|A,not A->B) P(not A->B|A) = P(A->B|A)
After it obtains proof that "A->B", call this p, we have P(A->B|A,p) = 1, so
P(B|A,p) = P(B|A,A->B,p) P(A->B|A,p) + P(B|A,not A->B,p) P(not A->B|A,p)
collapses to
P(B|A,p) = P(B|A,A->B,p) = P(B|A,A->B) = 1
But I don't think it is reasonable to skip straight to this final statement, unless the cost of obtaining p is negligible.
edit3: If this somehow violates Savage or Cox's theorems I'd like to know why :).
Perhaps, though, you could argue it differently. I have been trying to understand so-called "operational" subjective statical methods recently (as advocated by Frank Lad and his friends), and he is insisting on only calling a thing a [meaningful, I guess] "quantity" when there is some well-defined operational procedure for measuring what it is. Where for him "measuring" does not rely on a model, he is refering to reading numbers off some device or other, I think. I don't quite understand him yet, since it seems to me that the numbers reported by devices all rely on some model or other to define them, but maybe one can argue their way out of this...
Thanks, this seems interesting. It is pretty radical; he is very insistent on the idea that for all 'quantities' about which we want to reason there must some operational procedure we can follow in order to find out what it is. I don't know what this means for the ontological status of physical principles, models, etc, but I can at least see the naive appeal... it makes it hard to understand why a model could ever have the power to predict new things we have never seen before though, like Higgs bosons...
An example of a "true number" is mass. We can measure the mass of a person or a car, and we use these values in engineering all the time. An example of a "fake number" is utility. I've never seen a concrete utility value used anywhere, though I always hear about nice mathematical laws that it must obey.
It is interesting that you choose mass as your prototypical "true" number. You say we can "measure" the mass of a person or car. This is true in the sense that we have a complex physical model of reality, and in one of the most superficial levels of this model (Newtonian mechanics) there exist some abstract numbers which characterise the motions of "objects" in response to "forces". So "measuring" mass seems to only mean that we collect some data, fit this Newtonian model to that data, and extract relatively precise values for this parameter we call "mass".
Most of your examples of "fake" numbers seem to be to be definable in exactly analogous terms. Your main gripe seems to be that different people try to use the same word to describe parameters in different models, or perhaps that there do not even exist mathematical models for some of them; do you agree? To use a fun phrase I saw recently, the problem is that we are wasting time with "linguistic warfare" when we should be busy building better models?
Sure, I don't want to suggest we only use the word 'probability' for epistemic probabilities (although the world might be a better place if we did...), only that if we use the word to mean different sorts of probabilities in the same sentence, or even whole body of text, without explicit clarification, then it is just asking for confusion.
Hmm, do you know of any good material to learn more about this? I am actually extremely sympathetic to any attempt to rid model parameters of physical meaning; I mean in an abstract sense I am happy to have degrees of belief about them, but in a prior-elucidation sense I find it extremely difficult to argue about what it is sensible to believe a-priori about parameters, particularly given parameterisation dependence problems.
I am a particle physicist, and a particular problem I have is that parameters in particle physics are not constant; they vary with renormalisation scale (roughly, energy of the scattering process), so that if I want to argue about what it is a-priori reasonable to believe about (say) the mass of the Higgs boson, it matters a very great deal what energy scale I choose to define my prior for the parameters at. If I choose (naively) a flat prior over low-energy values for the Higgs mass, it implies I believe some really special and weird things about the high-scale Higgs mass parameter values (they have to be fine-tuned to the bejesus); while if I believe something more "flat" about the high scale parameters, it in turn implies something extremely informative about the low-scale values, namely that the Higgs mass should be really heavy (in the Standard Model - this is essentially the Hierarchy problem, translated into Bayesian words).
Anyway, if I can more directly reason about the physically observable things and detach from the abstract parameters, it might help clarify how one should think about this mess...
Hmm, interesting. I will go and learn more deeply what de Finetti was getting at. It is a little confusing... in this simple case ok fine p can be defined in a straightforward way in terms of the predictive distribution, but in more complicated cases this quickly becomes extremely difficult or impossible. For one thing, a single model with a single set of parameters may describe outcomes of vastly different experiments. E.g. consider Newtonian gravity. Ok fine strictly the Newtonian gravity part of the model has to be coupled to various other models to describe specific details of the setup, but in all cases there is a parameter G for the universal gravitation constant. G impacts on the predictive distributions for all such experiments, so it is pretty hard to see how it could be defined in terms of them, at least in a concrete sense.
Are you referring to De Finetti's theorem? I can't say I understand your point. Does it relate to the edit I made shortly before your post? i.e. Given a stochastic model with some parameters, you then have degrees of belief about certain outcomes, some of which may seem almost the same thing as the parameters themselves? I still maintain that the two are quite different: parameters characterise probability distributions, and just in certain cases happen to coincide with conditional degrees of belief. In this 'beliefs about beliefs' context, though, it is the parameters we have degrees of belief about, we do not have degrees of belief about the conditional degrees of belief to which said parameters may happen to coincide.
"Jonah was looking at probability distributions over estimates of an unknown probability (such as the probability of a coin coming up heads)"
It sounds like you are just confusing epistemic probabilities with propensities, or frequencies. I.e, due to physics, the shape of the coin, and your style of flipping, a particular set of coin flips will have certain frequency properties that you can characterise by a bias parameter p, which you call "the probability of landing on heads". This is just a parameter of a stochastic model, not a degree of belief.
However, you can have a degree of belief about what p is no problem. So you are talking about your degree of belief that a set of coin flips has certain frequentist properties, i.e. your degree of belief in a particular model for the coin flips.
edit: I could add that GIVEN a stochastic model you then have degrees of belief about whether a given coin flip will result in heads. But this is a conditional probability: see my other comment in reply to Vanvier. This is not, however, "beliefs about beliefs". It is just standard Bayesian modelling.
"I view these sorts of distributions over distributions as that- there's some continuous parameter potentially in the world (the proportion of white and black balls in the urn), and that continuous parameter may determine my subjective probability about binary events (whether ball #1001 is white or black)."
To me this just sounds like standard conditional probability. E.g. let p(x|I) be your subjective probability distribution over the parameter x (fraction of white balls in urn), given prior information I. Then
p("ball 1001 is white"|I) = integral_x { p("ball 1001 is white"|x,I)*p(x|I) } dx
So your belief in "ball 1001 is white" gets modulated by your belief distributions over x, sure. But I wouldn't call this a "distribution over a distribution". Yes, there is a set of likelihoods p("ball 1001 is white"|x,I) which specify your subjective degree of belief in "ball 1001 is white" GIVEN various x, but in then end you want your degree of belief in "ball 1001 is white" considering ALL values that x might have and their relative plausibilities, i.e. you want the marginal likelihood to make your predictions.
(my marginalisation here ignores hypotheses outside the domain implied by there being a fraction of balls in the urn...)
Lol ok, so long as I get my answer eventually :p.
Was the "Putting in the Numbers" post the one you were referring to? You didn't post that on Saturday, but now it is Monday and there doesn't seem be a third post. Anyway I did not see this question answered anywhere in "Putting in the Numbers"...
Yeah I think integral( p*log(p) ) is it. The simplest problem is that if I have some parameter x to which I want to assign a prior (perhaps not over the whole real set, so it can be proper as you say -- the boundaries can be part of the maxent condition set), then via the maxent method I will get a different prior depending on whether I happen to assign the distribution over x, or x^2, or log(x) etc. That is, the prior pdf obtained for one parameterisation is not related to the one obtained for a different parameterisation by the correct transformation rule for probability density functions; that is, they contain logically different information. This is upsetting if you have no reason to prefer one parameterisation or another.
In the simplest case where you have no constraints except the boundaries, and maybe expect to get a flat prior (I don't remember if you do when there are boundaries... I think you do in 1D at least) then it is most obvious that a prior flat in x contains very different information to one flat in x^2 or log(x).
Refering to this:
"Simply knowing the fact that the entropy is concave down tells us that to maximize entropy we should split it up as evenly as possible - each side has a 1/4 chance of showing."
Ok, that's fine for discrete events, but what about continuous ones? That is, how do I choose a prior for real-valued parameters that I want to know about? As far as I am aware, MAXENT doesn't help me at all here, particularly as soon as I have several parameters, and no preferred parameterisation of the problem. I know Jaynes goes on about how continuous distributions make no sense unless you know the sequence whose limit you took to get there, in which case problem solved, but I have found this most unhelpful in solving real problems where I have no preference for any particular sequence, such as in models of fundamental physics.
It would have been kind of impossible to work on AI in 1850, before even modern set theory was developed. Unless by work on AI, you mean work on mathematical logic in general.
Ok, but do you really mean that sentence how it is written? To me it means the same thing as saying that assigning probability to anything is logically equivalent to assigning probability to 0=1 (which I am perfectly happy to do so if that is the point then fine, but that doesn't seem to be your implication)
"But to assign some probability to the wrong answer is logically equivalent to assigning probability to 0=1."
Only if you know it is the wrong answer. You say the robot doesn't know, so what's the problem? We assign probabilities to propositions which are wrong all the time, before we know if they are wrong or not.
The statistics also remains important at the frontier of high energy physics. Trying to do reasoning about what models are likely to replace the Standard Model is plagued by every issue in the philosophy of statistics that you can imagine. And the arguments about this affect where billions of dollars worth of research funding end up (build bigger colliders? more dark matter detectors? satellites?)
I can't disagree with that :p. I will concede that the survey question needs some refinement.
Hmm, I couldn't agree with that later definition. Physics is just the "map" after all, and we are always improving it. Mathematics (or some future "completed" mathematics) seems to me the space of things that are possible. I am not certain, but this might be along the lines of what Wittgenstein means when he says things like
"In logic nothing is accidental: if a thing can occur in an atomic fact the possibility of that atomic fact must already be prejudged in the thing.
If things can occur in atomic facts, this possibility must already lie in them.
(A logical entity cannot be merely possible. Logic treats of every possibility, and all possibilities are its facts.)" (from the Tractatus - possibly he undoes all this in his later work, which I have yet to read...)
This is a tricky nest of definitions to unravel of course. I prefer to not call anything supernatural unless it lies outside the "true" order of reality, not just if it isn't on our map yet. I am a physicist though, and it is hard for me to see the logical possibility of anything outside the "true" order of the universe. Nevertheless, it seems to me like this is what people intend when they talk about God. But then they also try to prove that He must exist from logical arguments. These goals seem contradictory to me, but I guess that's why I'm an athiest :p.
I don't know where less "transcendant" supernatural entities fit into this scheme of course. Magic powers and vampires etc need not neccessarily defy logical description, they just don't seem to exist.
I agree that in the end, banishing the word supernatural is probably the easiest way to go :p.
But don't you think there is an important distinction between events that defy logical description of any kind, and those that merely require an outlandish multi-layered reality to explain? I admit I can't think of anything that could occur in our world that cannot be explained by the simulation hypothesis, but assuming that some world DOES exist outside the layers of nested simulation I can (loosely speaking) imagine that some things really are logically impossible there. And that if the inhabitants of that world observe such impossible events, well, they will wrongly concluded that they are in a simulation, but actually there will be truly supernatural happenings afoot.
I mention this somewhat pointless story just because religious philosophers would generally not accept that God is merely supernatural in your sense, I think they would insist on something closer to my sense, nonsense though it may be.
I'm no theologian, but it seems to me that this view of the supernatural does not conform to the usual picture of God philosophers put forward, in terms of being the "prime mover" and so on. They are usually trying to solve the "first cause" problem, among other things, which doesn't really mesh with God as the super-scientist, since one is still left wondering about where the world external to the simulation comes from.
I agree that my definition of the supernatural is not very useful in practice, but I think it is necessary if one is talking about God at all :p. What other word should we use? I quite like your suggested "extra-natural" for things not of this world, which leaves supernatural for things that indeed transcend the constraints of logic.
To me, the simulation hypothesis definitely does not imply a supernatural creator. 'Supernatural' implies 'unconstrained by natural laws', at least to me, and I see no reason to expect that the simulation creators are free from such constraints. Sure, it means that supernatural-seeming events can in principle occur inside the simulation, and the creators need not be constrained by the laws of the simulation since they are outside of it, but I fully expect that some laws or other would govern their behaviour.
You don't think people here have a term for their survey-completing comrades in their cost function? Since I probably won't win either way this term dominated my own cost function, so I cooperated. An isolated defection can help only me, whereas an isolated cooperation helps everyone else and so gets a large numerical boost for that reason.
Lol, I cooperated because $60 was not a large enough sum of money for me to really care about trying to win it, and in the calibration I assumed most people would feel similarly. Reading your reasoning here, however, it is possible I should have accounted more strongly for people who like to win just for the sake of winning, a group that may be larger here than in the general population :p.
Edit: actually that's not really what I mean. I mean people who want to make a rational choice to maximum the probability of winning for its own sake, even if they don't actually care about the prize. I prefer someone gets $60 and is pleasantly surprised to have won, than I get $1. I predict that overall happiness is increased more this way, at negligible cost to myself. Even if the person who wins defected.
It defined "God" as supernatural didn't it? In what sense is someone running a simulation supernatural? Unless you think for some reason that the real external world is not constrained by natural laws?
In this case, Feynman is worth listening to slowly. There is something about the way he explains this that the transcript does not do justice to.
When you prove something in mathematics, at very least you implicitly assume you have made no mistakes anywhere, are not hallucinating, etc. Your "real" subjective degree of belief in some mathematical proposition, on the other hand, must take all these things into account.
For practical purposes the probability of hallucinations etc. may be very small and so you can usually ignore them. But the OP is right to demonstrate that in some cases this is a bad approximation to make.
Deductive logic is just the special limiting case of probability theory where you allow yourself the luxury of an idealised box of thought isolated from "real world" small probabilities.
edit: Perhaps I could say it a different way. It may be reasonable for certain conditional probabilities to be zero or one, so long as they are conditioned on enough assumptions, e.g. P("51 is a prime" given "I did my math correctly, I am not hallucinating, the external world is real, etc...")=1 might be achievable. But if you try to remove the conditional on all that other stuff you cannot keep this certainty.
It is not very useful to discriminate between "seeing with your eyes" and "seeing with the aid of scientific instruments". Vast amounts of information processing occurs between light landing on your retina and an image forming in your brain, so if you are happy to call looking through glasses, or a microscope, or a telescope, "seeing with your eyes" then I see no reason to make a distinction when the information-carrying particle switches from photons to electrons. Especially since we mostly use digital microscopes etc. these days.
Bayes theorem only works with as much information as you put into it. Humans can only ever be approximate Bayesian agents. If you learn about some proposition you never though of before it is not a failing of Bayesian reasoning, it is just that you learn you have been doing it wrong up until that point and have to recompute everything.
I'd just like to point out that even #1 of the OP's "lessons" is far more problematic than they make it seem. Consider the statement:
"The fact that there are myths about Zeus is evidence that Zeus exists. Zeus's existing would make it more likely for myths about him to arise, so the arising of myths about him must make it more likely that he exists." (supposedly an argument of the form P(E | H) > P(E)).
So first, "Zeus's existing would make it more likely for myths about him to arise" - more likely than what? Than "a priori"? This is essentially impossible to know, since to compute P(E) you must do P(E) = sum(i) { P(E|H[i])*P(H[i]) }, i.e. marginalise over a mutually exclusive set of hypotheses (and no "Zeus" and "not Zeus" does not help, because "not Zeus" is a compound hypothesis which you also need to marginalise over).
I will grant you that it may seem plausible to guess that the average P(E|H[i]) over all possible explanations for E is lower than P(E|Zeus) (since most of them are bad explanations), but since the average is weighted by the various priors P(H[i]), then if our background knowledge causes some high likelihood explanation for E (high P(E|H[i])) to dominates the average then P(E) may not be less than P(E|Zeus) even if P(E|Zeus) is relatively high! In which case E actually counts against the Zeus hypothesis, since P(H|E)<P(H) if P(E|H)<P(E).
Whether this is the case or not in the example is tough to say, (and of course is relative to the agents background knowledge), but I think it worth emphasising that it is not so easy as it seems.
If you are introduced to 5 blue-haired Xians but no black-haired Xians, you might infer that all or most Xians have blue hair. That is a pretty obvious case of sampling bias.
If a-priori you had no reason to expect that the population was dominantly blue-haired then you should begin to suspect some alternative hypothesis, like your sampling is biased for some reason, rather than believe everyone is blue haired.
Of course acting on beliefs is a decision theory matter. You don't have terribly much to lose by buying a losing lottery ticket, but you have a very large amount to gain if it wins, so yes 1/132 chance of winning sounds well worth $20 or so.