## Posts

## Comments

**samshap**on Radical Probabilism · 2020-09-06T06:35:51.319Z · LW · GW

Hmmmm. Unfortunately I'm not sure what to say to this one except that in logical induction, there's not generally a pre-existingzwe can update on like that.

So that's my real crux, and any examples with telephone calls and earthquakes etc are merely illustrative for me. (Like I said, I don't know how to actuallymotivateany of this stuff except with actual logical uncertainty, and I'm surprised that any philosophers would have become convinced just from other sorts of examples.)

I agree that the logical induction case *is* different, since it's hard to conceive of likelihoods to begin with. Basically, logical induction doesn't even include what I would call virtual evidence. But many of the examples you gave do have such a z. I think I agree with your crux, and my main critique here is just in the examples of overly dogmatic Bayesian who refuses to acknowledge the difference between a and z. I won't belabor the point further.

I've thought of another motivating example, BTW. In wartime, your enemy deliberately sends you some verifiably true information about their force dispositions. How should you update on that? You can't use a Bayesian update, since you don't actually have a likelihood model available. We can't even attempt to learn a model from the information, since we can't be sure its representative.

I don't get this at all! What do you mean?

By model M, I mean an algorithm that generates likelihood functions, so M(H,Z) = P(Z|H).

So any time we talk about a likelihood P(Z|H), it should really read P(Z|H,M). We'll posit that P(H,M) = P(H)P(M) (i.e. that the model says nothing about our priors), but this isn't strictly necessary.

E(P(Z|H,M)) will be higher for a well calibrated model than a poorly calibrated model, which means that we expect P(H,M|Z) to also be higher. When we then marginalize over the models to get a final posterior on the hypothesis P(H|Z), it will be dominated by the well-calibrated models: P(H|Z) = SUM_i P(H|M_i,Z)P(M_i|Z).

BTW, I had a chance to read part of the ILA paper. It barely broke my brain at all! I wonder if the trick of enumerating traders and incorporating them over time could be repurposed to a more Bayesianish context, by instead enumerating models M. Like the trading firm in ILA, a meta-Bayesian algorithm could keep introducing new models M_k over time, with some intuition that the calibration of the best model in the set would improve over time, perhaps giving it all those nice anti-dutch book properties. Basically this is a computable Solomonoff induction, that slowly approaches completeness in the limit. (I'm pretty sure this is not an original idea. I wouldn't be surprised if something like this contributed to the ILA itself).

Of course, its pretty unclear how this would work in the logical induction case. This might all be better explained in its own post.

**samshap**on Radical Probabilism · 2020-08-25T03:51:32.599Z · LW · GW

You're right, youcouldhave an event in the event space which is just "the virtua-evidence update [such-and-such]". I'm actually going to pull out this trick in a future follow-up post.

I note that that's not how Pearl or Jeffrey understand these updates. And it's a peculiar thing to do -- something happens to make you update a particular amount, but you're just representing the event bythe amount you update. Virtual evidence as-usually-understood at least coins a new symbol to represent the hard-to-articulate thing you're updating on.

That's not quite what I had in mind, but I can see how my 'continuously valued' comment might have thrown you off. A more concrete example might help: consider Example 2 in this paper. It posits three events:

**b** - my house was burgled

* a* - my alarm went off

**z** - my neighbor calls to tell me the alarm went off

Pearl's method is to take what would be uncertain information about **a **(via my model of my neighbor and the fact she called me) and transform it into virtual evidence (which includes the likelihood ratio). What I'm saying is that you can just treat **z** as being an event itself, and do a Bayesian update from the likelihood P(**z**|**b**)=P(z|a)P(a|b)+P(z|~a)P(~a|b), etc. This will give you the exact same posterior as Pearl. Really, the only difference in these formulations is that Pearl only needs to know the ratio P(z|a):P(z|~a), whereas traditional Bayesian update requires actual values. Of course, any set of values consistent with the ratio will produce the right answer.

The slightly more complex case (and why I mentioned continuous values) is in section 5 where the message includes probability data, such as a likelihood ratio. Note that the continuous value *is not the amount you update *(at least not generally), because its not generated from your own models, but rather by the messenger. Consider event **z99**, where my neighbor calls to say she's 99% sure the alarm went off. This doesn't mean I have to treat P(z99|b):P(z99|~b) as 99:1 - I might model my neighbor as being poorly calibrated (or as not being independent of other information I already have), and use some other ratio.

In what sense?What technical claim about Bayesian updates are you trying to refer to?

Definitely the second one, as optimal update policy. Responding to your specific objections:

This is only true ifthe only information we have coming in is a sequence of propositions which we are updating 100% on.

As you'll hopefully agree with at this point, we can always manufacture the 100% condition by turning it into virtual evidence.

This optimality propertyonlyif we believe something like grain-of-truth.makes sense

I believe I previously conceded this point - the true hypothesis (or at least a 'good enough' one) must have a nonzero probability, which we can't guarantee.

But properties such as calibration and convergence also have intuitive appeal

Re: calibration - I still believe that this can be included if you are jointly estimating your model and your hypothesis.

Re: convergence - how real of a problem is this? In your example you had two hypotheses that were precisely equally wrong. Does convergence still fail if the true probability is 0.500001 ?

(By the way, I really appreciate your in-depth engagement with my position.)

Likewise! This has certainly been educational, especially in light of this:

Sadly, the actual machinery of logical induction was beyond the scope of this post, butthere are answers.I just don't yet know a good way to present it all as a nice, practical, intuitively appealing package.

The solution is too large to fit in the margins, eh? j/k, I know there's a real paper. Should I go break my brain trying to read it, or wait for your explanation?

**samshap**on Radical Probabilism · 2020-08-25T01:50:30.038Z · LW · GW

Phew! Thanks for de-gaslighting me.

**samshap**on Radical Probabilism · 2020-08-22T05:58:26.619Z · LW · GW

I definitely missed a few things on the first read through - thanks for repeating the ratio argument in your response.

I'm still confused about this statement:

Virtual evidence requires probability functions to take arguments which aren't part of the event space.

Why can't virtual evidence messages be part of the event space? Is it because they are continuously valued?

As to *why *one would want to have Bayesian updates be normative: one answer is that they maximize our predictive power, given sufficient compute. Given the name of this website, that seems a sufficient reason.

A second answer you hint at here:

The second seems more practical for the working Bayesian.

As a working Bayesian myself, having a practical update rule is quite useful! As far as I can tell, I don't see a good alternative in what you have provided.

Then we have to ask *why not *(steelmanned) classical Bayesianism? I think you've two arguments, one of which I buy, the other I don't.

The practical problem with this, in contrast to a more radical-probabilism approach, is that the probability distribution then has to explicitly model all of that stuff.

This is the weak argument. Computing P(A*|X) "the likelihood I recall seeing A given X" is not a fundamentally different thing than modeling P(A|X) "the likelihood signal A happened given X". You have to model an extra channel effect or two, but that's just a difference of degree.

Immediately after, though, you have the better argument:

As Scott and I discussed in Embedded World-Models, classical Bayesian models require the world to be in the hypothesis space (AKA realizability AKA grain of truth) in order to have good learning guarantees; so, in a sense, they require that the world is smaller than the probability distribution. Radical probabilism does not rest on this assumption for good learning properties.

if I were to paraphrase - Classical Bayesianism can fail entirely when the world state does not fit into one of its nonzero probability hypotheses, which must be of necessity limited in any realizable implementation.

I find this pretty convincing. In my experience this is a problem that crops up quite frequently, and requires meta-Bayesian methods you mentioned like calibration (to notice you are confused) and generation of novel hypotheses.

(Although Bayesianism is not completely dead here. If you reformulate your estimation problem to be over the hypothesis space and model space jointly, then Bayesian updates can get you the sort of probability shifts discussed in Pascal's Muggle. Of course, you still run into the 'limited compute' problem, but in many cases it might be easier than attempting to cover the entire hypothesis space. Probably worth a whole other post by itself.)

**samshap**on Radical Probabilism · 2020-08-21T04:18:31.796Z · LW · GW

Why is a dogmatic Bayesian not allowed to update on virtual evidence? It seems like you (and Jeffries?) have overly constrained the types of observations that a classical Bayesian is allowed to use, to essentially sensory stimuli. It seems like you are attacking a strawman, given that by your definition, Pearl isn't a classical Bayesian.

I also want to push back on this particular bit:

Richard Jeffrey (RJ):Tell me one peice of information you're absolutely certain of in such a situation.

DP:I'm certain I had that experience, of looking at the cloth.

RJ:Surely you aren't 100% sure you were looking at cloth. It's merely very probable.

DP:Fine then. The experience of looking at ... what I was looking at.

I'm pretty sure we can do better. How about:

**DP: ***Fine then. I'm certain I remember believing that I had seen that cloth.*

For an artificial dogmatic probabilist, the equivalent might be:

**ADP: ***Fine then. I'm certain of evidence A* : that my probability inference algorithm received a message with information about an observation A.*

Essentially, we update on A* instead of A. When we compute the likelihood P(A*|X), we can attempt to account for all the problems with our senses, neurons, memory, etc. that result in P(A*|~A) > 0.

RJ still has a counterpoint here:

RJ:Again I doubt it. You're engaging ininner-outer hocus pocus.*There is no clean dividing line before which a signal is external, and after which that signal has been "observed". The optic nerve is a noisy channel, warping the signal. And the output of the optic nerve itself gets processed at V1, so the rest of your visual processing doesn't get direct access to it, but rather a processed version of the information. And all this processing is noisy. Nowhere is anything certain. Everything is a guess. If, anywhere in the brain, there were a sharp 100% observation, then the nerves carrying that signal to other parts of the brain would rapidly turn it into a 99% observation, or a 90% observation...

But I don't find this compelling. At some point there is a boundary to the machinery that's performing the Bayesian update itself. If the message is being degraded *after* this point, then that means we're no longer talking about a Bayesian updater.

**samshap**on The Ghost of Joseph Weber · 2020-07-14T01:28:45.785Z · LW · GW

Thanks for presenting your thesis. However, one of your figures doesn't support your argument on closer inspection. The figure that you point to as being the 'unfiltered' data is measuring cross-correlation between the Hanford and Livingston datasets, so we should expect it to look completely different than the datasets themselves.

I also want to push back on a particular point - there's nothing wrong in principle with using a black-hole shaped filter to find black holes. You just have to adjust the prior based on the complexity of your filter.

**samshap**on The Hammer and the Dance · 2020-03-21T05:51:24.044Z · LW · GW

I've been lurking lesswrong for years, and this is the article that actually got me to create an account. I am promoting this to everyone I can that has a scrap of political influence - my bosses (I work at a major university), my local newspaper, my rabbis, my local politicians. Every state in the country should be enacting the same measures as New York and Texas.

I would urge the lesswrong community to

a: constructively critique the article as Chris recommends (use argument to make it stronger)

b: shut up and do the impossible - if your state governor hasn't already shut down restaurants, public gatherings, and restricted all non-essential travel, get them to do it ASAP. If we figured out how to get a handler to unbox a superhuman intelligence, and how to defeat Voldemort, we at least owe this an attempt.