# Frequentist Magic vs. Bayesian Magic

post by Wei_Dai · 2010-04-08T20:34:44.866Z · score: 43 (43 votes) · LW · GW · Legacy · 79 comments[I posted this to open thread a few days ago for review. I've only made some minor editorial changes since then, so no need to read it again if you've already read the draft.]

This is a belated reply to cousin_it's 2009 post Bayesian Flame, which claimed that frequentists can give calibrated estimates for unknown parameters without using priors:

And here's an ultra-short example of what frequentists can do: estimate 100 independent unknown parameters from 100 different sample data sets and have 90 of the estimates turn out to be

true to factafterward. Like, fo'real. Always 90% in the long run, truly, irrevocably and forever.

And indeed they can. Here's the simplest example that I can think of that illustrates the spirit of frequentism:

Suppose there is a machine that produces biased coins. You don't know how the machine works, except that each coin it produces is either biased towards heads (in which case each toss of the coin will land heads with probability .9 and tails with probability .1) or towards tails (in which case each toss of the coin will land tails with probability .9 and heads with probability .1). For each coin, you get to observe one toss, and then have to state whether you think it's biased towards heads or tails, and what is the probability that's the right answer.

Let's say that you decide to follow this rule: after observing heads, always answer "the coin is biased towards heads with probability .9" and after observing tails, always answer "the coin is biased towards tails with probability .9". Do this for a while, and it will turn out that 90% of the time you are right about which way the coin is biased, no matter how the machine actually works. The machine might always produce coins biased towards heads, or always towards tails, or decide based on the digits of pi, and it wouldn't matter—you'll still be right 90% of the time. (To verify this, notice that in the long run you will answer "heads" for 90% of the coins actually biased towards heads, and "tails" for 90% of the coins actually biased towards tails.) No priors needed! Magic!

What is going on here? There are a couple of things we could say. One was mentioned by Eliezer in a comment:

It's not perfectly reliable. They assume they have perfect information about experimental setups and likelihood ratios. (Where does this perfect knowledge come from? Can Bayesians get their priors from the same source?)

In this example, the "perfect information about experimental setups and likelihood ratios" is the information that a biased coin will land the way it's biased with probability .9. I think this is a valid criticism, but it's not complete. There are perhaps many situations where we have much better information about experimental setups and likelihood ratios than about the mechanism that determines the unknown parameter we're trying to estimate. This criticism leaves open the question of whether it would make sense to give up Bayesianism for frequentism in those situations.

The other thing we could say is that while the frequentist in this example appears to be perfectly calibrated, he or she is liable to pay a heavy cost for this in accuracy. For example, suppose the machine is *actually* set up to always produce head-biased coins. After observing the coin tosses for a while, a typical intelligent person, just applying common sense, would notice that 90% of the tosses come up heads, and infer that perhaps all the coins are biased towards heads. They would become more certain of this with time, and adjust their answers accordingly. But the frequentist would not (or isn't supposed to) notice this. He or she would answer "the coin is head-biased with probability .9" 90% of the time, and "the coin is tail-biased with probability .9" 10% of the time, and keep doing this, irrevocably and forever.

The frequentist magic turns out to be weaker than it first appeared. What about the Bayesian solution to this problem? Well, we know that it must involve a prior, so the only question is which one. The maximum entropy prior that is consistent with the information given in the problem statement is to assign each coin an independent probability of .5 of being head-biased, and .5 of being tail-biased. It turns out that a Bayesian using this prior will give the exact same answers as the frequentist, so this is also an example of a "matching prior". (To verify: P(biased heads | observed heads) = P(OH|BH)*P(BH)/P(OH) = .9*.5/.5 = .9)

But a Bayesian can do much better. A Bayesian can use a universal prior. (With a universal prior based on a universal Turing machine, the prior probability that the first 4 coins will be biased "heads, heads, tails, tails" is the probability that the UTM will produce 1100 as the first 4 bits of its output, when given a uniformly random input tape.) Using such a prior guarantees that no matter how the coin-producing machine works, as long as it doesn't involve some kind of uncomputable physics, in the long run your expected total Bayes score will be no worse than someone who knows exactly how the machine works, except by a constant (that's determined by the algorithmic complexity of the machine). And unless the machine actually settles into deciding the bias of each coin independently with 50/50 probabilities, your expected Bayes score will also be better than the frequentist (or a Bayesian using the matching prior) by an unbounded margin as time goes to infinity.

I consider this magic also, because I don't *really* understand why it works. Is our prior actually a universal prior, or is the universal prior just a handy approximation that we can substitute in place of the real prior? Why *does* the universe that we live in look like a giant computer? What about uncomputable physics? Just what are priors, anyway? These are some of the questions that I'm still confused about.

But as long as we're choosing between different magics, why not pick the stronger one?

## 79 comments

Comments sorted by top scores.

We can't use the universal prior in practice unless physics contains harnessable non-recursive processes. However, this is exactly the situation in which the universal prior doesn't always work. Thus, one source of the 'magic' is through allowing us to have access to higher levels of computation than the phenomena we are predicting (and to be certain of this).

Also, the constants involved could be *terrible* and there are no guarantees about this (not even probabilistic ones). It is nice to reach some ratio in the limit, but if your first Graham's number of guesses are bad, then that is very bad for (almost) all purposes.

We can't use the universal prior in practice unless physics contains harnessable non-recursive processes. However, this is exactly the situation in which the universal prior doesn't always work. Thus, one source of the 'magic' is through allowing us to have access to higher levels of computation than the phenomena we are predicting (and to be certain of this).

My position is that the uncomputabiilty of the universal prior shouldn't count against it. I think the fact that it works so well shows that our actual prior is likely also uncomputable, and that means we have to handle uncomputable priors in our decision theory, for example by specifying that we choose the option that we can prove (or just heuristically believe) has the highest expected payoff, instead of actually computing the expected payoffs.

A worse problem is that there seems to be reason to think that our actual prior is not just uncomputable, but unformalizable. See my earlier posts on this.

A worse problem is that there seems to be reason to think that our actual prior is not just uncomputable, but unformalizable. See my earlier posts on this.

If you make ontological claims, you are bound to get in trouble. Decision theory should speak of what the agent's algorithm should do, in terms of its behavior, not what it means for the agent to do that in terms of consequences in the real world. What the agent's algorithm does is always formalizable (as that algorithm!).

(For people unfamiliar with the discussion -- see "ontology problem" in this sequence.)

If you make ontological claims, you are bound to get in trouble.

It seems that we do tend to get into trouble when we make ontological claims, but why "bound to"? Your proposed FAI, after it has extracted human values, will still have to solve the ontological problem, right? If it can, then why can't we?

You advocate "being lazy" as FAI programmers and handing off as many problems as we can to the FAI, but I'm still skeptical that *any* FAI approach will succeed in the near future, and in the mean time, I'd like to try to better understand what my own values are, and how *I* should make decisions.

Your proposed FAI, after it has extracted human values, will still have to solve the ontological problem, right? If it can, then why can't we?

I don't believe even superintelligence can solve the ontology problem completely.

You advocate "being lazy" as FAI programmers and handing off as many problems as we can to the FAI, but I'm still skeptical that any FAI approach will succeed in the near future, and in the mean time, I'd like to try to better understand what my own values are, and how I should make decisions.

A fine goal, but I doubt it can contribute to FAI design (which, even it'll take more than a century to finish, still has to be tackled to make that possible). Am I right in thinking that you agree with that?

I don't believe even superintelligence can solve the ontology problem completely.

Why?

A fine goal, but I doubt it can contribute to FAI design (which, even it'll take more than a century to finish, still has to be tackled to make that possible).

I'm not sure what you're referring to by "that" here. Do you mean "preserving our preferences"? Assuming you do...

Am I right in thinking that you agree with that?

No, I think we have at least two disagreements here:

- If I can figure out what my own values are, there seem to be several ways that could contribute to FAI design. The simplest way is that I program those values into the FAI manually.
- I don't think FAI is necessarily the best way to preserve our values. I can, for example, upload myself, and then carefully increase my intelligence. As you've mentioned in previous comments, this is bound to cause value drift, but such drift may be acceptable, compared to the risks involved in implementing de novo FAI.

My guess is that the root cause of these disagreements is my distrust of human math and software engineering abilities, stemming from my experiences in the crypto field. I think there is a good chance that we (unenhanced biological humans) will never find the correct FAI theory, and that in the event we think we've found the right FAI theory, it will turn out that we're mistaken. And even if we manage to get FAI theory right, it's almost certain that the actual AI code will be riddled with bugs. You seem to be less concerned with these risks.

Could you suggest a source for further reading on this?

The Bayesian using the universal prior is cheating. Kolmogorov complexity is incomputable. So you have to use a realistic compressor. Then the optimality results go away. But, in practice if your compressor is good, it's almost always going to outdo the frequentist.

Yep, the universal prior should be filed under "science fiction". And you know why it's uncomputable? Because for any computable prior I can create a coin-producing machine that will anticipate and thwart the Bayesian's expectations :-)

The frequentist approach is analogous to a minimax strategy in a game: no matter how malicious the universe is, you still get your 90%. Other, non-minimax strategies inevitably take risks to try and exploit stupid opponents. This connection was formalized by Abraham Wald, who was mentioned by Cyan in the two posts that started this whole affair.

Replying to Wei Dai's post, I still don't know which magic is the stronger one or what "stronger" actually means here. In actual statistical practice, choosing good priors clearly requires skills and techniques that aren't part of the naive Bayesian canon. This creates a doubt in my mind that just won't go away.

The frequentist approach is analogous to a minimax strategy in a game: no matter how malicious the universe is, you still get your 90%. Other, non-minimax strategies inevitably take risks to try and exploit stupid opponents.

- If 'the universe is malicious' is the right prior to use then a Bayesian will use it.
- Using a universal prior on a malicious universe can only give a score that differs by a constant from the frequentist over infinite time.
- Any sane computable approximation of the universal prior would necessarily include a weight on the hypothesis 'the universe is more evil/complex than than can be represented with algorithms of the length that this approximation handles'.
- If the universe is malicious
*you probably shouldn't be doing statistics on it*.

In actual statistical practice, choosing good priors clearly requires skills and techniques that aren't part of the naive Bayesian canon.

This is true. That is when it is time to use frequentist approximations.

In actual statistical practice, choosing good priors clearly requires skills and techniques that aren't part of the naive Bayesian canon.

This is true. That is when it is time to use frequentist approximations.

Frequentist techniques can be useful in certain situations, but only because of the great difficulty in assigning accurate priors and the fact that we often have such overwhelmingly large amounts of evidence that any hypothesis with a substancial prior will probably have a posterior either near-zero or near-one if proper Bayesian reasoning is used. In those situations, judging the probability of a hypothesis by its p-values saves a lot of complicated work and is almost as good. However, when there is not overwhelming evidence or when the evidence points to a hypothesis with a negligible prior, frequentist statistics ceases to provide an adequate approximation. Always remember to think like a Bayesian, even when using frequentist methods.

Also, this was a terrible example of frequentist statistics avoiding the use of priors, because the prior was that the probability that a coin would land heads anything other than 90% or 10% of the time is zero. This is a tremendous assertion, and refusing to specify the probabilities of 90% heads vs. 10% heads just makes the prior incomplete. Saying that there is no prior is like saying that I did not make this post just because the last sentence lacks a period

because the prior was that the probability that a coin would land heads anything other than 90% or 10% of the time is zero. This is a tremendous assertion, and refusing to specify the probabilities of 90% heads vs. 10% heads just makes the prior incomplete

It has wings that are way too big for a mammal, and it lays eggs! Which just makes it an incomplete mammal.

It's really more than that. Consider the great anti-Bayesian Cosma Shalizi. He's shown that the use of a prior is really equivalent to a method of smoothing, of regularization on your hypothesis space, trading off (frequentist) bias and variance. And everyone has a regularization scheme, even if they claim it is "don't: 'let the data speak for themselves'".

Your assertion that the machine is exactly 90% biased is exactly equivalent to an evenly-split point-mass prior at 10% and 90%. A Bayesian with that prior would exactly reproduce your reasoning. You're absolutely right that it is in no way an incomplete prior -- it exactly specifies everything. One could consider it an overconfident prior, but if you do in fact know that it's either 10% or 90%, and have no idea which it's perfectly appropriate. The choice of which Frequentist statistical techniques to use is pretty much isomorphic to the choice of Bayesian priors.

EDIT: On the bias/variance trade-off, I just realized that the Frequentist prescription for binary frequency estimation s/(s+f) is, rather than a maximum entropy prior, a maximum variance prior. (To be specific it is an improper prior, the limiting case of the beta distribution, with both parameters going to zero. Although this has support everywhere in [0,1], if you try to normalize it, the integral going to infinity kills everything except delta spikes at 0 and 1. But normalizing after conditioning on data gives a reasonable prior, with the maximum likelihood estimate being the standard estimate.)

**[deleted]**· 2015-08-19T12:10:32.922Z · score: 1 (1 votes) · LW(p) · GW(p)

Cosma's writings for those interested

onsider the great anti-Bayesian Cosma Shalizi. He's shown that the use of a prior is really equivalent to a method of smoothing, of regularization on your hypothesis space, trading off (frequentist) bias and variance.

It seems odd to interpret this point as anti-Bayesian. To me it seems pro-Bayesian: it means that whenever you use a regularizer you're actually doing Bayesian inference. Any method that depends on a regularizer is open to the same critique of subjectivity to which Bayesian methods are vulnerable. Two frequentists using different regularizers will come to different conclusions based on the same evidence, and the choice of a regularizer is hardly inevitable or dictated by the problem.

If you have a link to a paper that contains anti-Bayesian arguments by Shalizi, I would be interested in reading it.

Well, it seems odd to me too. He has another rant up comparing Bayesian updating to evolution saying "okay, that's why Bayesian updating seems to actually work OK in many cases", whereas I see that as explaining why evolution works...

http://cscs.umich.edu/~crshalizi/weblog/cat_bayes.html

Is a good start. He also has a paper on the arXiv that is flat-out wrong, so ignore "The Backwards Arrow of Time of the Coherently Bayesian Statistical Mechanic", though showing how it goes wrong takes a fair bit of explaining of fairly subtle points.

He also has a paper on the arXiv that is flat-out wrong, so ignore "The Backwards Arrow of Time of the Coherently Bayesian Statistical Mechanic", though showing how it goes wrong takes a fair bit of explaining of fairly subtle points.

I've tried reading it before - for me to understand just the paper itself would also take a fair bit of explaining of fairly subtle points! I understand Shalizi's sketch of his argument in words:

Observe your system at time 0, and invoke your favorite way of going from an observation to a distribution over the system's states --- say the maximum entropy principle. This distribution will have some Shannon entropy, which by hypothesis is also the system's thermodynamic entropy. Assume the system's dynamics are invertible, so that the state at time t determines the states at times t+1 and t-1. This will be the case if the system obeys the usual laws of classical mechanics, for example. Now let your system evolve forward in time for one time-step. It's a basic fact about invertible dynamics that they leave Shannon entropy invariant, so it's still got whatever entropy it had when you started. Now make a new observation. If you update your probability distribution using Bayes's rule, a basic result in information theory shows that the Shannon entropy of the posterior distribution is, on average, no more than that of the prior distribution. There's no way an observation can make you more uncertain about the state on average, though particular observations may be very ambiguous. (Noise-free measurements would let us drop the "on average" qualifer.) Repeating this, we see that entropy decreases over time (on average).

My problem is that I know a plausible-looking argument expressed in words can still quite easily be utterly wrong in some subtle way, so I don't know how much credence to give Shalizi's argument.

The problem with the quoted argument from Shalizi is that it is describing a decrease in entropy over time of an *open* system. To track a closed system, you have to include the brain that is making observations and updating its beliefs. Making the observations requires thermodynamic work that can transfer entropy.

The frequentist approach is analogous to a minimax strategy in a game: no matter how malicious the universe is, you still get your 90%.

This is because you constrained the universe to only being able to present with you sequences of one of two possible values, both of which reveal themselves on first inspection 90% of the time.

Let the universe throw at you a machine that, for all you know, can produce any distribution or pattern of coins with any bias. Try to get your guaranteed 90% making predictions on the bias of those coins from a single flip.

For example, suppose the machine is actually set up to always produce head-biased coins. After observing the coin tosses for a while, a typical intelligent person, just applying common sense, would notice that 90% of the tosses come up heads, and infer that perhaps all the coins are biased towards heads. They would become more certain of this with time, and adjust their answers accordingly. But the frequentist would not (or isn't supposed to) notice this.

They wouldn't (they aren't)?

He or she would answer "the coin is head-biased with probability .9" 90% of the time, and "the coin is tail-biased with probability .9" 10% of the time, and keep doing this, irrevocably and forever.

Only if they committed to this particular rule at the start, I would have thought...? I'm unsure what compels a frequentist to use that particular rule.

I don't understand why your universal prior passes the tape through the Turing machine. If you have a source of genuinely random tape inputs in the first place, why not just use the probability that a tape has the first four bits '1100'? And in any case, what's the advantage of invoking Turing machinery in place of the good old fair coins?

You get completely different results by doing that. The universal prior gives higher probability to simpler hypotheses -- 0000000000000000 is more likely than 0010111001001101, or at least analogous things are true once the strings get long enough. The idea is that if there's any computable regularity in what you're trying to model, as you get more data you'll recognize that regularity (i.e., your posterior will give much higher probability to futures in which the regularity continues than to ones in which it doesn't).

Ah so, I understand. But in that case, doesn't evaluating your prior require you to integrate over the space of all possible Turing machines? There is no conceptual difficulty with this, but I must say that it strikes me as computationally rather formidable for a poxy little throw-four-possibly-biased-coins problem. :)

It's not just computationally formidable, it's computationally impossible. :-)

My understanding was that it's perfectly possible to solve the Halting Problem for particular TMs.

For a TM of length *n*, you begin running it and either it terminates at some point, or it runs for BusyBeaver(*n*)+1 steps, in which case you know it doesn't ever halt.

So as long as you have a fixed length *n* that all possible TMs fit into and know the Busy Beaver for *n*, you can in fact use the universal prior. (Busy Beavers are uncomputable in general, though we know a fair number, but this is an easy condition to weaken down to something practical like high probability of being the Busy Beaver.)

The fact that BB(n) is uncomputable means that your procedure doesn't, in fact, work. No?

You don't, in any case, have a fixed n. (Unless perhaps you are only interested in modelling Turing machines that fit in the universe -- but in that case, wouldn't you also want your own computations to fit in the universe?)

- Finding BB(n) for all n is uncomputable. Finding BB(n) where n=10, say, is not uncomputable.
- Yeah, this is an interesting point. My thinking is that yes, we do only care about modeling Turing machines that fit in the universe, since by definition those are the only Turing machines whose output we will ever observe. We might also have to bite the bullet and say we only care about predicting Turing machines which are small enough that there is room left in the universe for us; we don't want to perfectly model the entire universe, just small parts.

If it's uncomputable in general then it's uncomputable for some particular n. I expect that in fact it's uncomputable for any n above some rather small value. (In particular, once n is large enough for there to be n-state Turing machines that unprovably don't always halt.)

Well, if you only care about predicting smallish Turing machines, then for the same reason prediction is only useful if it can be done

*with*smallish machines, in reasonable time. And even for very small n this isn't true for calculating BB(n).

Just out of my own curiosity, I checked Wikipedia on Busy Beaver numbers: BB(10) > 3^^^3.

While the article admits that every *finite* sequence (such as *n* = 1...10) is computable, I think it is safe to call it *intractable* right now.

Hey, don't laugh. That's useful to know - now you know any Turing Machine less than 10 is bluffing when it threatens 3^^^^3 specks of sand in eyes.

now you know any Turing Machine less than 10 is bluffing when it threatens 3^^^^3 specks of sand in eyes.

How did you conclude that from a lower bound?

I don't think so, provided you limit yourself to Turing machines constructed of at most a universe's worth of energy and have unlimited time for the computation. However, that said, invoking such a heavy-duty prior seems to me a bit of a cop-out; you might almost as well invoke "My prior is what Omega told me". For choosing between Bayesian and frequentist approaches regarding actual problems in the real brain space available to humans, I prefer priors that can be computed in something like a human lifetime. To show that a Bayesian with a prior that would take multiple billion years to compute can do better than a frequentist is not really a very strong result.

The trick, and the reason for using Turing machines, is that '0000' and '1111' ought to be considered more likely than '0101' and '1010', because they're simpler; and the way this model represent simplicity, a sequence is simple if it's the output of short programs (where 'program' in this case means 'prefix to a Turing machine tape'). If you used the tape directly, then '0000000000' and '0100011011' would be equally likely, which, generally speaking, they're not.

What I keep coming to here is, doesn't the entire point of this post come to the situations where the parameters in question, the bias of the coins, are not independent? And doesn't this contradict?

estimate 100 independent unknown parameters

Which leads me to read the later half of this post as, we can (in principle, perhaps not computably) estimate 1 complex parameter with 100 data sets better than 100 independent unknown parameters from individual data sets. This shouldn't be surprising. I certainly don't find it as such.

The first half just points out that in the independent case of this particular example, Bayesian and Frequentist perform equivalently for relatively similar assumptions. But cousin_it made a general claim about the Frequentist approach, so this isn't worth much weight on its own.

How hard is it to "crack" a pseudorandom number generator? Say, one used to generate cards in an online poker game?

In the *very* early days of online poker (late 90's), a group of people did exactly that.

A number of factors combined to make it possible:

- The site published the code for its shuffler, supposedly to demonstrate its competence.
- The RNG was seeded by the number of milliseconds since midnight, a range laughably smaller than the range of possible poker hands.
- Online poker was played for sufficiently low stakes that nobody thought either of the above would be a real problem.

Nowadays, sites keep their algorithms secret, and the even distribution of hands is constantly being verified by third parties with large data sets. They also harvest large sources of entropy, from user mouse gestures and physical RNGs. See PokerStars' write-up on security.

They also harvest large sources of entropy, from user mouse gestures

So, at least in theory, one could a bit manipulate their RNG, sending them some "mouse moving signals back".

One would also need to know the other inputs and their hash function in order to control what the manipulation did. Otherwise, it would be "random".

Needn't to be a perfect manipulation, to get something already useful.

If the hash function is strong, it doesn't help - though as far as I know there's no good formal definition of the properties the hash function has to have.

One of the properties a cryptographic hash function should have is indeed that, if you add known plaintext to an unknown plaintext in any way, you still can't predict any property of the output that you couldn't without this - you get no additional information.

Of course, we still don't have a clue how to prove this for any interesting hash function. They frequently get broken, because there's basically no mathematical framework behind them.

there's basically no mathematical framework behind them.

This is a long, long, long way from the truth.

Anyone care to post evidence either way?

No need to worry, cipher is right. I'm being horribly inaccurate, as usual. Voted him up for correctness. :-)

What I should have said is that there are no correctness proofs, of the style you'd get for RSA. No guarantees, in other words. There is plenty of math, just no math that lets you prove they have the properties you want them to have; this is doubly annoying because, historically, they will almost certainly turn out later to not have them.

There's plenty of math that guarantees they lack specific vulnerabilities, and far more that attempts to exploit vulnerabilities, just nothing that says there are *no* vulnerabilities; at least, nothing I've heard of. Different hash functions have different levels of security; for example, most block ciphers can technically be used for hashing, hopefully with less chance of trouble than the faster MD5 or SHA hashes.

The fact of the matter is still that, outside of the annoyingly slow RSA family of ciphers, nothing I've heard of in cryptography that's practically usable also has a strong mathematical proof; a problem that is borne out by a long string of theoretical and sometimes practical attacks on the algorithms. It's easy to get disillusioned, but at least we get amusing attack names like "miss-in-the-middle", and SHA-2 is still mostly intact.

And lest someone calls me on it, I should probably point out that RSA does not, strictly speaking, have a security proof either. Its security hinges on a single open question, namely whether prime factorization is slow or not; that's a lot better than the attack surface of other ciphers, but not quite no surface at all, and that's even ignoring the existence of Shor's algorithm. If you have a quantum computer, RSA is broken.

Well, there are still elliptic curves.

of the style you'd get for RSA.

There are no such proofs for RSA either. The best you can hope for with public key crypto is a reduction to a problem believed hard; in the case of RSA, this is the "RSA problem" of finding roots modulo semiprimes of unknown factorization (which is *not* the integer factorization problem). I think in practice we have more faith in the security of our stronger symmetric primitives than we do that the hard problems in public key crypto are hard; this has things backwards.

Incidentally, as far as I can tell there's no good reason ever to use RSA; the Rabin family reduce to a strictly harder problem (factorization) and are faster too.

MD5 and SHA are built around custom block ciphers.

Against quantum computers lots of elliptic curve crypto is also broken. Look up "post-quantum cryptography" for a sampling of the algorithms that are not known to fall; none of the well-known ones.

Another word for a strong pseudorandom number generator is a "stream cipher".

(This isn't *strictly* true, but close enough)

Depends on the generator. Obviously hardware ones cannot be cracked, and even some software ones are thought to be cryptographically secure.

Thank you for introducing me to the "universal prior".

The model I'd have used in explaining a sequence of such coin flips would indeed assume that the machine generates each coin independently as heads-biased with fixed probability H. But rather than fixing H=1/2, I'd have a prior distribution for H (probably uniform).

How superior would the universal prior be to this? I confess I can't do the math.

Eliezer also wrote a post about the universal prior, although he didn't use that name. See Occam's Razor.

Also thanks for introducing me indirectly to AIXI, the universal artificial intelligence. Apparently, for every time and space bounds (t,l), Hutter has defined an algorithm AIXI(t,l) that performs optimally assuming "the environment is computable". Sounds like an awesome model to compare our own rationality against:

http://www.hutter1.net/ai/paixi.htm

This blog post has some interesting quotes, including one by Eliezer about the fearful unfriendliness of AIXI:

http://www.acceleratingfuture.com/michael/blog/2009/12/computable-aixi-should-we-be-afraid/

Apparently, for every time and space bounds (t,l), Hutter has defined an algorithm AIXI(t,l) that performs optimally assuming "the environment is computable". Sounds like an awesome model to compare our own rationality against:

Meh. I was excited about it at first, but here's the problem: that claim, at least as you've phrased it, is wrong. It would only be correct if you mean *average* performance over *all* computable environments, having an Occamian prior as their only knowledge.

But if you consider agents that operate in *this* computable environment, there are numerous examples that perform better than AIXI-tl. Specifically, all animals do. The reason is that they don't, like AIXI/AIXI-tl, start from tabula rasa, but are born with implicit knowledge about their environments (that is closer to the truth than a mere Occamian prior) and implicit heuristics for exploiting them.

That's why Hutter isn't just running AIXI-tl and outfoxing them all. (pardon the pun)

More generally, there's the important issue that we shouldn't *want* an agent that performs optimally, on average, over all computable environments, if that means sacrificing performance in *this* environment, which it almost certainly does.

More generally, there's the important issue that we shouldn't want an agent that performs optimally, on average, over all computable environments, if that means sacrificing performance in this environment, which it almost certainly does.

Couldn't we patch this easily? Supply the agent with all prior observations. If it's optimal, it'll make better use of the observations than we or animals would.

Or to put it another way: wouldn't AIXI-tl perform optimally on average over all computable environments consistent with our history? That seems desirable.

(Interesting thought: evolution can only discover a few bits every generation, and likely our genomes and cellular environments embody many fewer than that upper limit of a few gigabytes; how many observations would we have to supply a new AIXI-tl before it caught up to us?)

Short answer: we can patch it, but not easily.

Longer answer: If we could somehow *represent* the knowledge we have (both explicit and implicit) in a format that integrates nicely with the way the AIXI-approximating program stores its own knowledge, then we could "bring it up to speed" with where we are, and let it learn from there. The knowledge we give it would make it throw out the spurious solutions it would otherwise have to consider.

But representation of our knowledge, especially the important implicit knowledge we never realize we're using, is very hard -- probably AI-complete. So it doesn't reduce the problem. And any knowledge you'd be feeding it would be found by some *human* way, when finding this knowledge is the very thing you're supposed to be automating! But yes, that would work.

Or to put it another way: wouldn't AIXI-tl perform optimally on average over all computable environments consistent with our history?

I'm not sure, but I don't think so. The requirement that it perform optimally on average (probability weighted) over all computable environments consistent with its history means that it has to pass some of its performance into those other environments, which can still be drastically different.

One thing to keep in mind: Though we like to neatly model beliefs, values, and observations as cleanly separated, Nature is under no obligation to respect the difference -- in nature, these are all blended together, such that the three can only be inferred from the overall dynamics. This is why I want to set up some kind of model of an environment that is constrained by (some analog of) the laws of thermodynamics and chemistry so that we can see the development of (phenomena we could describe as) complexity, intelligence, and life, but without explicitly programming any of those into the model.

If we could somehow represent the knowledge we have (both explicit and implicit) in a format that integrates nicely with the way the AIXI-approximating program stores its own knowledge, then we could "bring it up to speed" with where we are, and let it learn from there.

Too many restrictions there, I think. The format doesn't have to be nice - any format it doesn't know it will know after a fixed-length penalty. We could just dump in the Internet Archive raw.

The requirement that it perform optimally on average (probability weighted) over all computable environments consistent with its history means that it has to pass some of its performance into those other environments, which can still be drastically different.

But those other environments where it performs poorly are the ones that ought to perform poorly: its best performance is reserved for the most likely future histories, just like we would aspire to.

We would perform poorly in an anti-Occamian universe just like it would, but we're far from optimal and so would perform worse in other scenarios, I would think. I suppose we could be so biased and incorrect that we luck out and our biases and errors are just right, but is it plausible that we could luck out enough to overcome the general performance difference?

Too many restrictions there, I think. The format doesn't have to be nice - any format it doesn't know it will know after a fixed-length penalty. We could just dump in the Internet Archive raw.

Alright, the thing I meant by "nice" needs some elaboration. I'll put it this way: For an Ultimate AI, **all** of its knowledge -- with no exceptions -- is in terms of constraints on what it expects to observe. (And yes, this is what rationalists should strive for too.) So there is no, "light is waves", no "that's the Doppler effect". There are only mappings from inputs to probability distributions on future inputs. (Confusingly, this would also mean an expectation for [phenomena explainable as] humans [generating sound waves explainable by them] saying [something we would recognize as the statement] "light is waves". Phew!)

Any large human-generated knowledge base initially appears to AIXI as a long string of characters and/or some input/output black box. What in its inputspace do the characters refer to? What is the appropriate way to group them? Most importantly, after being told that "this stuff is true" or "this is a lot of what goes on in the environment that computes your inputs", how does it know how it maps to the rest of the environment's generating function? (Which I guess is ultimately the same as the first question.)

That problem is nearly as intractable as starting from just an Occamian prior. It's only resolved by symbol-grounding, which means representing knowledge in the form of a probability distribution on observations, in a way your program can understand. Which I think brings you back into the AI-complete realm.

But those other environments where it performs poorly are the ones that ought to perform poorly: its best performance is reserved for the most likely future histories, just like we would aspire to.

Okay, you're right, I wasn't keeping the comparison baselines straight. If you could give the program enough knowledge, in a form it understands or could quickly learn, to distinguish this computable environment from substantively different environment (including its location within it, the relevant history, etc.), then yes, it would make better inferences than humans.

But the point stands that dumping any kind of knowledge base on a computable version of AIXI won't help you a bit until you've done a lot more of the cognitive labor.

That's a good point; in-born priors are much better (as priors) for typical human environments, and plenty of environments we'd like to design robots for.

But consider things like black holes, the relativity of pretty much everything kinematic *except* light speed, and almost all of quantum mechanics. These theories stopped seeming like outright paradoxes to me precisely when I eased up and starting assuming only that the universe was probably just computational in some sense, and not necessarily "3D-physical" or "spacial" or "temporal" or whatever. So naturally, I'm pleased to find a well-developing theory of "AIXI", an idealized implementation of this very starting point :)

One way that the universal prior is superior to the one you propose is that it can notice that the machine alternates between producing heads-biased and tails-biased coins.

In this comment, I described another prior (still much easier to use that the universal prior) that could detect this pattern, and also noted another behavior the machine can have that my prior can't notice.

The advantage of the universal prior is that it assigns non-zero probability to all computable behaviors the machine might have, so that it can concentrate probability into that behavior if the machine exhibits it.

How superior would the universal prior be to this? I confess I can't do the math.

Far, far superior - Wei Dai did not say that the coins are independently distributed.

How superior would the universal prior be to this? I confess I can't do the math.

A lot, but it is not (particularly) quantifiable. Allowing H to be an unknown constant rather than 0.5 allows for many more possible algorithms, but it is only an iterative step towards 'allow any algorithm'. That, in turn, is not as good as 'allow any algorithm but with a prior distribution based on complexity'.

After observing the coin tosses for a while, a typical intelligent person, just applying common sense, would notice that 90% of the tosses come up heads, and infer that perhaps all the coins are biased towards heads. They would become more certain of this with time, and adjust their answers accordingly.

A bayesian could do better if you allowedtot to remember things.

coin lands on heads:
P(headsbiased|heads) = .9*.5/.5 =.9
P(tailsbiased|heads) = .1*.5/.5=.1
another heads:
P(headsbiased|heads)=.9*.9/(.1*.1+.9*.9)=.987
P(tailsbiased|heads)=.0121

Oops. I meant that for all possible biases torwards heads, not always or never. Except that only has 3/4 success.

I'm just saying comparing it to a human is unfair. Humans use feeble subsets of the universal prior that are easy to imitate.

**[deleted]**· 2015-09-17T13:19:43.620Z · score: -3 (3 votes) · LW(p) · GW(p)

I wish people like yourselves would spend your time trying to work on cures for diseases instead of thinking up different ways to explain mundane events

This is a belated reply to

cousin_it's2009 post Bayesian Flame

Not to be a grammar Nazi, but I believe it should be **cousin_its**....

Orthography is not intuitive. To test my native speaker instinct, I'll pick a case that is. Imagine a user whose name was "Praise_Him". To me, it would be more natural to say "Praise_Him's post" than "Praise_His post"; the former might give me a second's pause, but the latter would make me reread the sentence. Thus, at least the way I use the language, a proper name which incorporates a pronoun is possessivized as a whole, and cousin_it's is correct. But "Its" and "It's" are homophonous, so it wouldn't matter to me much.

The underscore _ signifies italics in whatever-markup-system-this-is: `_emph_`

=> *emph*. You can escape the underscore with a backslash before it: `\_emph\_`

=> _emph_.

Fixed. Thanks.

**[deleted]**· 2013-03-11T13:05:53.934Z · score: 0 (0 votes) · LW(p) · GW(p)

A similar issue is whether to say (*n* + 1)-st or (*n* + 1)-th.

Now I'm going to be arguing about that with myself all day. How do you possessivize a proper noun that ends in the *word* "it"?

I vote for "cousin_it's".

(Yes, that is a period outside of the quotation mark. I am unrepentant.)

I agree with your vote and your punctuation.

"Cousin_it's" is good. It's like eir name is Schmit or something.

I too am behind your punctuation choice.

The ones in the right have nothing to repent for!

(Yes, that is a period outside of the quotation mark. I am unrepentant.)

For similar reasons I would not personally discourage people from (incorrectly) referring to (for example) Clippy's paper clips as " it's ". The language is clearly flawed.