# Significance of Compression Rate Method

post by Daniel_Burfoot · 2010-05-30T03:50:25.401Z · score: 5 (30 votes) · LW · GW · Legacy · 60 comments

Summary: The significance of the Compression Rate Method (CRM) is that it justifies a form of empirical inquiry into aspects of reality that have previously resisted systematic interrogation. Some examples of potential investigations are described. A key hypothesis is discussed, and the link between empirical science and lossless data compression is emphasized.

In my previous post, the protagonist Sophie developed a modified version of the scientific method. It consists of the following steps:

1. Obtain a large database T related to a phenomenon of interest.
2. Develop a theory of the phenomenon, and instantiate the theory as a compression program.
3. Test the theory by invoking the compressor on T and measuring the net codelength achieved (encoded data plus length of compressor).
4. Given two rival theories of the phenomenon, prefer the one that achieves a shorter net codelength.

This modified version preserves two of the essential attributes of the traditional method. First, it employs theoretical speculation, but guides and constrains that speculation using empirical observations. Second, it permits Strong Inference by allowing the field to make decisive comparisons between rival theories.

The key difference between the CRM and the traditional method is that the former does not depend on the use of controlled experiments. For that reason, it justifies inquiries into aspects of empirical reality that have never before been systematically interrogated. The kind of scientific theories that are tested by the CRM depend on the type of measurements in the database target T. If T contains measurements related to physical experiments, the theories of physics will be necessary to compress it. Other types of data lead to other types of science. Consider the following examples:

1. Set up a camera next to a highway, and record the stream of passing cars. To compress the resulting data, you will need to develop a computational understanding of the visual appearance of automobiles. You will need theories of hubcaps, windshields, license plates, car categories, and so on.
2. Position some microphones in the tops of trees and start recording. A major source of variation in the resulting data will be bird vocalization. To compress the data, you will need to find ways to differentiate between bird songs and bird calls, tools to identify species-characteristic vocalizations, and maps showing the typical ranges of various species. In other words, this type of inquiry will be a computational version of the traditional study of bird vocalization carried out by ornithologists.
3. Construct a database by using large quantities of English text. To compress this database you will need an advanced computational understanding of English. You will need dictionaries, rules of grammar, word-sense disambiguation tools, and, more generally, theories of linguistics.
4. Convince Mark Zuckerberg to give you the Facebook image database. One obvious property of this dataset is that it contains an enormous number of faces. To compress it, you will need theories of the appearance of faces. These theories will be highly related to work on face modeling in graphics - see here for example.
5. Generate a huge database of economic data such as home prices, interest and exchange rate fluctuations, business inventories and sales, unemployment and welfare applications, and so on. To compress this database, you will need theories of economics.

It should be emphasized that when in the above list it says "You will need theories of X", this simultaneously means that "You can test and refine theories of X", and "You can prove the superiority of your pet theory of X" by demonstrating the codelengths it achieves on an appropriate dataset. So if you are a linguist and you want to demonstrate the validity of X-bar theory, you build an X-bar compressor and test it on the large text database. If you are an economist and you want to prove the truth of Austrian Business Cycle Theory, you build ABCT into a compressor and invoke it on the economics database. If a theory can't be packaged into a compressor for some real world dataset, then it's probably not scientific anyway (more later on the problem of demarcation).

(It's also worth noting the dedication to truth, and the simultaneous contempt for petty academic affiliation games, indicated by a rigorous adherence to the compression principle. If you develop a new theory of linguistics and use it to set a new record on the benchmark text database, I will hail you as a great linguist. I will publish your papers in my journal, nominate you for awards, and approve your grant applications. It does not matter if you are a teenage college dropout living with your parents.)

The inquiries described in the above list make an important implicit assumption, which can be called the Reusability Hypothesis:

The abstractions useful for practical applications are also useful for compression.

Thus one very practical application in relation to the Facebook database is the detection and recognition of faces. This application depends on the existence of a "face" abstraction. So the hypothesis implies that this face abstraction will be useful for compression as well. Similarly, with regards to the ornithology example, one can imagine that the ability to recognize bird song would be very useful to bird-watchers and environmentalists, who might want to monitor the activity, population fluctuations, and migration patterns of certain species. Here the Reusability Hypothesis implies that the ability to recognize bird-song will also be useful to compress the treetop sound database.

The linguistics example is worth examining because of its connection to a point made in the preface about overmathematization and the distinction of complex deduction vs. complex induction. The field of computational linguistics is highly mathematized, and one can imagine that in principle some complex mathematics might be useful to achieve text compression. But by far the simplest, and probably the most powerful, tool for text compression is just a dictionary. Consider the following sentence:

John went to the liquor store and bought a bottle of _______ .

Now, if a compressor knows nothing about English text, it will have to encode the new word letter-by-letter, for a cost of Nlog(26), where N is the length of the word (I assume N is encoded separately, at a basically fixed cost). But a compressor equipped with a dictionary will have to pay only log(Wn), where Wn is the number of words of length N in the dictionary. This is a substantial savings, since Wn is much smaller than 26^N. Of course, more advanced techniques, such as methods that take into account part of speech information, will lead to further improvement.

The point of the above example is that the dictionary is highly useful, but it does not involve any kind of complex mathematics. Compare a dictionary to the theory of general relativity. Both can be used to make predictions, and so both should be viewed (under my definition) as legitimate scientific theories. And they are both complex, but complex in opposite ways. GR is deductively complex, since it requires sophisticated mathematics to use correctly, but inductively simple, because it requires only a few parameters to specify. In contrast the dictionary is deductively simple, since it can be used by anyone who can read, but inductively complex, since it requires many bits to specify.

Another point made in the preface was that this approach involves empirical science as a core component, as opposed to being primarily about mathematics and algorithm-design (as I consider modern AI to be). Some people may be confused by this, since most people consider data compression to be a relatively minor subfield of computer science. The key realization is that lossless data compression can only be achieved through empirical science. This is because data compression is impossible for arbitrary inputs: no compressor can ever achieve compression rates of less than M bits when averaged over all M-bit strings. Lossless compressors work because they contain an implicit assertion about the type of data on which they will be invoked, and they will fail to achieve compression if that assertion turns out to be false. In the case of image compressors like PNG, the assertion is that, in natural images, the values of adjacent pixels are highly correlated. PNG can exploit this structure to achieve compression, and conversely, the fact that PNG achieves compression for a given image means that it has the assumed structure. In other words, PNG contains an empirical hypothesis about the structure of visual reality, and the fact that it works is empirical evidence in favor of the hypothesis. Now, this pixel-correlation structure of natural images is completely basic and obvious. The proposal, then, is to go further: to develop increasingly sophisticated theories of visual reality and test those theories using the compression principle.

Still, it may not be obvious why this kind of research would be any different from other work on data compression - people are, after all, constantly publishing new types of compression algorithms. The key difference is the emphasis on large-scale compression; this completely changes the character of the problem. To see why, consider the problem of building a car. If you only want to build a single car, then you just hack it together by hand. You build all the parts using machine tools and then fit them together. The challenge is to minimize the time- and dollar-cost of the manual labor. This is an interesting challenge, but the challenge of building ten million cars is entirely different. If you're going to build ten million cars, then it makes sense to start by building a factory. This will be a big up-front cost, but it will pay for itself by reducing the marginal cost of each additional car. Analogously, when attempting to compress huge databases, it becomes worthwhile to build sophisticated computational tools into the compressor. And ultimately the development of these advanced tools is the real goal.

comment by PhilGoetz · 2010-06-01T17:04:23.014Z · score: 8 (8 votes) · LW · GW

I have some gripes with these two posts as being not nearly compressed enough; but I'm shocked by the number of comments disagreeing with the posts, or speculating that compression is not relevant to AI. This is not stuff you can rationally disagree with. Any model useful for AI does compression, period.

A valid criticism would be that we have a scientific methodology built on statistical tests for significance; and to ask for a way to translate compression results into significance tests. The easiest way would be to compress a large number of examples, and perform tests on the mean and standard deviation of the resulting compression ratios.

comment by RichardKennaway · 2010-06-01T22:13:04.661Z · score: 1 (1 votes) · LW · GW

Any model useful for AI does compression, period.

Any model does compression, period. What is the particular relevance to AI? And if that was your downvote on my other comment, how does thinking of AI in terms of compression help to develop AI?

comment by PhilGoetz · 2010-06-02T19:29:07.459Z · score: 4 (4 votes) · LW · GW

I'm afraid it was my downvote. One example of using compression is Hinton's deep autoencoder networks, which work (although he doesn't say this, the math does) by fine-tuning each layer so as to minimize the entropy of the node activations when presented with the items to be learned. In other words: Instead of trying to figure out what features to detect, develop features that compress the original information well. Magically, these features are very good for performing categorization with.

AI was seldom thought of as compression until about 1986. Also, AI wasn't very good until 1986. Pre-1986, learning was ignored, and logic was king. All the pure-logic approaches suffer from combinatorial explosion, because they don't use entropy to enumerate possibilities in order of usefulness. The hard problems of compression were hidden by supplying AI programs with knowledge already compressed into symbols in the appropriate way; but they still didn't work, unless the number of possible actions/inferences was also restricted artificially.

There are people, like Rodney Brooks, who say logic isn't necessary at all. I wouldn't go that far. So, I overstated: There is work to be done in AI that isn't about compression, except in a very abstract way. Lots of work has been done without thinking of it as being compression. But I would say that the hard stuff that gives us problems (categorization, similarity; recognizing, recalling, and managing state-space trajectories) is closely tied to questions of compression.

comment by marks · 2010-06-06T03:41:47.278Z · score: 0 (0 votes) · LW · GW

I have a minor disagreement, which I think supports your general point. There is definitely a type of compression going on in the algorithm, it's just that the key insight in the compression is not to just "minimize entropy" but rather make the outputs of the encoder behave in a similar manner as the observed data. Indeed, that was one of the major insights in information theory is that one wants the encoding scheme to capture the properties of the distribution over the messages (and hence over alphabets).

Namely, in Hinton's algorithm the outputs of the encoder are fed through a logistic function and then the cross-entropy is minimized (essentially the KL divergence). It seems that he's more providing something like a reparameterization of a probability mass function for pixel intensities which is a logistic distribution when conditioned on the "deeper" nodes. Minimizing that KL divergence means that the distribution is made to be statistically indistinguishable from the distribution over the data intensities (since the KL-divergence minimizes expected log likelihood ratio-which means minimizing the power over the uniformly most powerful test).

Minimizing entropy blindly would mean the neural network nodes would give constant output: which is very compressive but utterly useless.

comment by marks · 2010-06-06T04:13:29.746Z · score: 1 (1 votes) · LW · GW

(A text with some decent discussion on the topic)[http://www.inference.phy.cam.ac.uk/mackay/itila/book.html]. At least one group that has a shot at winning a major speech recognition benchmark competition uses information-theoretic ideas for the development of their speech recognizer. Another development has been the use of error-correcting codes to assist in multi-class classification problems (google "error correcting codes machine learning")[http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=error+correcting+codes+machine+learning] (arguably this has been the clearest example of a paradigm shift that comes from thinking about compression which had a big impact in machine learning). I don't know how many people think about these problems in terms of information theory questions (since I don't have much access to their thoughts): but I do know at least two very competent researchers who, although they never bring it outright into their papers, they have an information-theory and compression-oriented way of posing and thinking about problems.

I often try to think of how humans process speech in terms of information theory (which is inspired by a couple of great thinkers in the area), and thus I think that it is useful for understanding and probing the questions of sensory perception.

There's also a whole literature on "sparse coding" (another compression-oriented idea originally developed by biologist but since ported over by computer vision and a few speech researchers) whose promise in machine learning may not have been realized yet, but I have seen at least a couple somewhat impressive applications of related techniques appearing.

comment by RichardKennaway · 2010-06-06T09:13:43.620Z · score: 0 (0 votes) · LW · GW

Thanks to you and PhilGoetz for those references. I have updated my estimate of the subject.

comment by orthonormal · 2010-05-30T04:15:27.912Z · score: 8 (18 votes) · LW · GW

Downvoted for sheer repetitive drudgery, without genuine motivation for the reader. How many chapters before you show us a real-world instance of CRM achieving progress in research?

When Eliezer had a sequence to write, each post was (1) interesting to read on its own, (2) connected to some real event/interesting story/experimental data, and (3) clearly adding something new to the conversation. You would do well to work on those aspects of your writing.

comment by Daniel_Burfoot · 2010-05-30T13:35:14.619Z · score: 1 (3 votes) · LW · GW

For my own clarification, did you think the idea of using CRM to evaluate theories of linguistics was uninteresting, or did you just not believe me that it would work?

comment by orthonormal · 2010-05-30T18:08:19.895Z · score: 2 (4 votes) · LW · GW

The latter. Until I see a real-world case where CRM has been very effective compared to other methods, I'm not going to give much credit to claims that it will achieve greatness in this, that and the other field.

And in particular, I find it extremely unlikely that current major theories of linguistics could in practice be coded into compressors, in a way that satisfies their proponents.

comment by marks · 2010-06-06T04:28:37.982Z · score: 1 (1 votes) · LW · GW

This isn't precisely what Daniel_Burfoot was talking about but its a related idea based on "sparse coding" and it has recently obtained good results in classification:

http://www.di.ens.fr/~fbach/icml2010a.pdf

Here the "theories" are hierarchical dictionaries (so a discrete hierarchy index set plus a set of vectors) which perform a compression (by creating reconstructions of the data). Although they weren't developed with this in mind, support vector machines also do this as well, since one finds a small number of "support vectors" that essentially allow you to compress the information about decision boundaries in classification problems (support vector machines are one of the very few things from machine learning that have had significant and successful impacts elsewhere since neural networks).

The hierarchical dictionaries learned do contain a "theory" of the visual world in a sense, although an important idea is that they do so in a way that is sensitive to the application at hand. There is much left out by Daniel_Burfoot about how people actually go about implementing this line of thought.

comment by SilasBarta · 2010-05-30T15:42:14.751Z · score: 2 (2 votes) · LW · GW

For me, along with everything orthonormal and I said before, the problem here is that your application of CRM to linguistics is just:

"CRM is a good epistemology (which the entire reader base already agrees with). Here's the obvious things you would do if you applied it to linguistics. Don't worry, something new is coming, just wait a few more articles."

comment by Daniel_Burfoot · 2010-05-30T20:32:15.247Z · score: 0 (0 votes) · LW · GW

Don't worry, something new is coming,

It's a new idea to use the CRM to approach computer vision. Nobody in computer vision thinks about large scale lossless compression of natural images. All work in CV is predicated on the idea that good methods can be obtained through pure theorizing and algorithm design. Consider for example this famous CV paper. Note the complex mathematics and algorithm design, and then note that the empirical verification consists of the application of the method to about four images (the reader is supposed to verify that the segmentation result agrees with intuition). Or check out this rival technique which actually uses MDL but doesn't report the actual compression rates, only the segmentation results.

comment by SilasBarta · 2010-05-30T20:53:15.242Z · score: 3 (3 votes) · LW · GW

If that was your point, you could have resolved this in one article whose thesis is just, "Hey, like you already know, MML is a superior method of rationality compared to traditional science. [Here's a review of MML.] Behold, the best work in computer vision ignores it! Look how much better you'd do, just by reading this site!"

You didn't need to go into all the fluff about AI's advances and failures, what's wrong with science, blah blah blah.

(Not to say this doesn't benefit you; you seem to get net positive karma, with the x10 modifier, each time you draw this out to yet another post, despite not presenting anything new.)

Of course, I thought your original motivation for this series was to explain what super-powerful epistemology it is babies must be using to be able to go from a near-blank slate[1] to solving AI-complete problems on limited empirical data, and how we can replicate that. And you haven't even touched that one.

[1] which I, and the best scientific research, dispute is an accurate characterization of babies

comment by Thomas · 2010-05-30T07:13:33.395Z · score: 0 (2 votes) · LW · GW

As almost everything else, much less interesting, wouldn't be said 100 times?

I like it. Probably be cause I mostly agree.

comment by Jonathan_Graehl · 2010-05-30T08:36:28.152Z · score: 3 (3 votes) · LW · GW

Evaluation of AI research is important.

However, I doubt you can even figure out how to use a human-level explanation of the contents of an image to improve lossless image compression rate, so I'd suggest a different route.

There are already many small competitions in AI research - someone creates a dataset/task with a method of evaluating accuracy (usually agreement with human annotation), and others try to do better.

I'd like to see the use of open/common training and test sets eventually lead to a requirement that systems also be published and reproducible by others, or else the research papers about them are rejected. At that point, you can try whatever compression rate method you like on everybody's systems. I'd work toward that, rather than selling CRM.

CRM as you describe it here, if used to decide funding for AI research competitively, would encourage a lot of work in general data compression and not very much AI, at least for a few years.

comment by PhilGoetz · 2010-06-01T16:56:35.038Z · score: 0 (2 votes) · LW · GW

However, I doubt you can even figure out how to use a human-level explanation of the contents of an image to improve lossless image compression rate, so I'd suggest a different route.

Who's interested in lossless compression? Not people interested in AI.

comment by Jonathan_Graehl · 2010-06-01T18:47:12.563Z · score: 1 (1 votes) · LW · GW

Daniel's CRM is about lossless compression, thus my response. I can't tell what you agree or disagree with.

Minimum description length and information theory are both part of the standard AI curriculum. I'd agree that most AI work suggests (and uses in its implementation, e.g. Bloom filters) lossy compression.

comment by SilasBarta · 2010-06-01T19:07:37.678Z · score: 0 (0 votes) · LW · GW

Daniel's CRM is about lossless compression, thus my response.

I don't think the distinction between lossy and lossless is as important in AI as you imply here:

I doubt you can even figure out how to use a human-level explanation of the contents of an image to improve lossless image compression rate, so I'd suggest a different route.

As you will note from my comment here, any lossy method can be trivially converted to a corresponding lossless one by adding on a list of "exceptions" to the "theory". So long as the exceptions are few, the compressed string can still be shorter than the source.

So if you notice a major regularity in the data, using it can still result in a shortening of the message length, even if not all of the data adheres to it. Thus, the constraint that the method be lossless is not significant in this context.

comment by Douglas_Knight · 2010-06-02T00:45:09.176Z · score: 0 (0 votes) · LW · GW

I don't disagree with your breaking a lossless compression into those three parts, but converting a lossy compression scheme to a lossless one seem a poor way to judge it: you condemn jpeg as doing no compression. Yes, you probably could use the same ideas to improve lossless compression, but the 90% compression rate of jpeg means something.

Lossless compression has the advantage of objective judgement, but for any particular purposes we want a lossy compression that throws out the data irrelevant to that purpose. Of course, that depends on the purpose; if you're debugging a camera, you don't want jpeg.

In other words, I endorse lossy compression over lossless compression as a measure of theory. (I think Jonathan Graehl and Phil Goetz agree, I'm not sure.)

comment by SilasBarta · 2010-06-02T01:02:04.209Z · score: 1 (1 votes) · LW · GW

But jpeg is being judged by a slightly different figure of merit, which changes the relevant considerations.

Like other lossy image compression schemes, jpeg is judged based on how much of the source picture it is that a human notices as being lost. So, for that problem, the relevant "source" to be encoded is really the human perception of the image, not the bitstream of the image file.

And really, since all phenomena pass through someone's senses before they're observed, this is the relevant metric for any MML formalism: whether you can come up with a way to briefly describe what you have experienced. It's just that in typical lossless tests, they've eliminated the impact of human cognition by making it possible to compare directly to a specific string and determine if that string can be completely reproduced.

So I don't condemn jpeg or regard it as meaningless; I'm saying it can be directly compared to lossless methods if you either a) penalize it by the amount of data needed to go from the jpeg to the source image, or b) penalize it by the amount of data needed, on average, to restore the image to something a human would not remember differently (which in many cases is very little if any data).

comment by Daniel_Burfoot · 2010-05-30T13:45:53.572Z · score: 0 (0 votes) · LW · GW

However, I doubt you can even figure out how to use a human-level explanation of the contents of an image to improve lossless image compression rate

This is the key assumption. I don't believe it will work, at first, on arbitrary images. But I think it will work on e.g. the roadside video data, where the variation is due primarily to cars.

comment by RobinZ · 2010-05-30T15:01:01.190Z · score: 1 (1 votes) · LW · GW

Not read the article yet, but:

Summary: The significance of the Compression Rate Method (CRM) is that it justifies a form of hard empirical science that does not depend on controlled experimentation.

Like astrophysics, cosmology, evolutionary biology, and geology?

I'm not sure that hard empirical science sans controlled experimentation needs justification.

Edit: I have nothing to say about the article.

comment by JoshuaZ · 2010-05-30T21:07:16.936Z · score: 1 (1 votes) · LW · GW

Evolutionary biology is subject to a lot of controlled experiments. See my reply to Marks below.

comment by Daniel_Burfoot · 2010-05-30T15:19:55.240Z · score: 0 (0 votes) · LW · GW

I see your point and changed the sentence in the summary.

comment by timtyler · 2010-05-30T15:04:02.988Z · score: 0 (0 votes) · LW · GW

Surely those have "controlled experimentation".

comment by marks · 2010-05-30T15:13:16.591Z · score: 2 (2 votes) · LW · GW

Those sciences are based on observations. Controlled experimentation requires that you have some set of experimental units to which you randomly assign treatments. With geology, for instance, you are trying to figure out the structure of the Earth's crust (mostly). There are no real treatments that you apply, instead you observe the "treatments" that have been applied by the earth to the earth. I.e. you can't decide which area will have a volcano, or an earthquake: you can't choose to change the direction of a plate or change the configuration of the plates: you can't change the chemical composition of the rock under large scale, etc.

All one can do is carefully collect measurements, build models of them, and attempt to create a cohesive picture that explains the phenomena. Control implies that you can do more than just collect measurements.

comment by JoshuaZ · 2010-05-30T21:05:56.422Z · score: 3 (3 votes) · LW · GW

Of the examples given, some of them certainly involve controled experiments in the classical sense. Evolutionary biology for example involves tests of genetic drift and speciation in the lab environment. For example, one matter that has been extensively tested in labs is different speciation mechanisms. The founder-effect mechanism is one that is particularly easy to test in a lab. For one major paper on the subject see this paper. A much older example is speciation by hybridization which has been tested in controlled lab environments for about a century now. The oldest I'm aware of in that regard is a 1912 paper by Digby (I haven't read it, and I'd have to go look up the citation but it shouldn't be hard to find ), and there have been many papers since then on the same topic.

Edit:Citation for Digby according to TOA is Digby, L. 1912. The cytology of Primula kewensis and of other related Primula hybrids. Ann. Bot. 26:357-388.

comment by marks · 2010-05-30T22:21:00.476Z · score: 1 (1 votes) · LW · GW

All the sciences mentioned above definitely do rely on controlled experimentation. But their central empirical questions are not amenable to being directly studied by controlled experimentation. We don't have multiple earths or natural histories upon which we can draw inference about the origins of species.

There is a world of difference between saying "I have observed speciation under these laboratory conditions" and "speciation explains observed biodiversity". These are distinct types of inferences. This of course does not mean that people who perform inference on natural history don't use controlled experiments: indeed they should draw on as much knowledge as possible about the mechanisms of the world in order to construct plausible theories of the past: but they can't run the world multiple times under different conditions to test their theories of the past in the way that we can test speciation.

comment by timtyler · 2010-05-30T15:20:42.798Z · score: 2 (2 votes) · LW · GW

I think that is a non-standard interpretation of the terminology:

"A controlled experiment generally compares the results obtained from an experimental sample against a control sample, which is practically identical to the experimental sample except for the one aspect whose effect is being tested (the independent variable)."

http://en.wikipedia.org/wiki/Experiment#Controlled_experiments

It never says the control sample has been influenced by the experimenter. It could instead be chosen by the experimenter - from the available spectrum of naturally-occurring phenomena.

comment by marks · 2010-05-30T20:21:57.938Z · score: 0 (0 votes) · LW · GW

I think it's standard in the literature: "The word experiment is used in a quite precise sense to mean an investigation where the system under study is under the control of the investigator. This means that the individuals or material investigated, the nature of the treatments or manipulations under study and the measurement procedures used are all settled, in their important features at least, by the investigator." The theory of the design of experiments

To be sure there are geological experiments where one, say, takes rock samples and subjects various samples to a variety of treatments, in order to simulate potential natural processes. But there is another chunk of the science which is meant to describe the Earth's geological history and for a controlled experiment on that you would need to control the natural forces of the Earth and to have multiple Earths.

The reason why one needs to control an experiment (this is a point elaborated on at length in Cox and Reid) is in order to prevent bias. Take the hypothesis of continental drift. We have loads of "suspicious coincidences" that suggest continental drift (such as similar fossils on different landmasses, certain kinds of variations in the magnetic properties of the seafloor, the fact that the seafloor rocks are much younger than land rocks, earthquake patterns/fault-lines). Critically, however, we don't have an example of an earth that doesn't have continental drift. It is probably the case that some piece of "evidence" currently used to support the theory of continental drift will turn out to be a spurious correlation. Its very difficult to test for these because of the lack of control. The fact that we are almost certainly on a continental-drifting world biases us towards think that some geological phenomenon is caused by drift even when they not.

comment by timtyler · 2010-05-30T20:54:23.468Z · score: 0 (0 votes) · LW · GW

Natural experiments are experiments too. See:

http://en.wikipedia.org/wiki/Natural_experiment

http://en.wikipedia.org/wiki/Experiment

http://dictionary.reference.com/browse/experiment

I think the usage in the cited book is bad and unorthodox. E.g. one can still study storms experimentally - though nobody can completely control a storm.

comment by marks · 2010-05-30T22:09:34.881Z · score: 2 (2 votes) · LW · GW

I think we are talking past each other. I agree that those are experiments in a broad and colloquial use of the term. They aren't "controlled" experiments: which is a term that I was wanting to clarify (since I know a little bit about it). This means that they do not allow you to randomly assign treatments to experimental units which generally means that the risk of bias is greater (hence the statistical analysis must be done with care and the conclusions drawn should face greater scrutiny).

Pick up any textbook on statistical design or statistical analysis of experiments and the framework I gave will be what's in there for "controlled experimentation". There are other types of experiments. But these suffer from the problem that it can be difficult to sort out hidden causes. Suppose we want to know if the presence of A causes C (say eating meat causes heart disease). In an observational study we find units having trait A and those not (so find meat-eaters and vegetarians) and we then wait to observe response C. If we observe a response C in experimental units possessing trait A, its hard to know if A causes C or if there is some third trait B (present in some of the units) which causes both A and C.

In the case of a controlled experiment, A is now a treatment and not a trait of the units (so in this case you would randomly assign a carnivorous or vegetarian diet to people), thus we can randomly assign A to the units (and assume the randomization means that not every unit having hidden trait B will be given treatment A). In this case we might observe that A and C have no relation, whereas in the observational study we might. (For instance people who choose to be vegetarian may be more focused on health)

An example of how econometricians have dealt with "selection bias" or the fact that observation studies fail to have certain nice properties of controlled experiments is here

comment by RobinZ · 2010-05-30T15:08:19.096Z · score: 0 (0 votes) · LW · GW

I believe they borrow results from fields that do have controlled experimentation, but they don't have a great deal on their own - they concern the analysis of historical data, I believe. Unless "search this region of sky to see whether Planet X is there" counts as a controlled experiment.

I'm not an authority on any of these four fields, though - am I mistaken?

comment by timtyler · 2010-05-30T15:33:49.484Z · score: 0 (0 votes) · LW · GW

It's a "controlled experiment" if there's a sample which is used as a control. Whether that sample is under the control of the experimenter (or arises naturally) is a totally different issue:

http://en.wikipedia.org/wiki/Natural_experiment

http://en.wikipedia.org/wiki/Experiment#Controlled_experiments

comment by RobinZ · 2010-05-30T16:35:57.024Z · score: 0 (0 votes) · LW · GW

Fair enough - the author has changed the wording of the main article in any case.

comment by RichardKennaway · 2010-06-01T09:27:06.268Z · score: 0 (2 votes) · LW · GW

when attempting to compress huge databases, it becomes worthwhile to build sophisticated computational tools into the compressor. And ultimately the development of these advanced tools is the real goal.

I don't see why thinking of the problem as a compression task will cast any light on the problem of building these advanced tools. Especially when, as you already pointed out in your examples, the concepts needed to build these tools are not discoverable from the data -- or at least, you are envisaging them as not being discovered from the data.

comment by [deleted] · 2010-05-31T07:10:00.452Z · score: 0 (0 votes) · LW · GW

Let's take the stock market as an example. The stock market prices are in principle predictable, only not from the data itself but from additional data taken from the newspapers or other sources. How does the CRM apply if the data does not in itself contain the neccessary information?

Let's say I have a theory that cutting production costs will increase stock prices in relation to the amount of cost cut and the prominence of the company and the level of fear of a crash on the stock market and the level of a "bad news indicator" that is a weighted sum of bad press for the company in the past. How would I test my theory with CRM?

comment by Jonathan_Lee · 2010-05-31T07:48:26.228Z · score: 0 (0 votes) · LW · GW

In the wider sense, MML still works on the dataset {stock prices, newspapers, market fear}. Regardless of what work has presently been done to compress newspapers and market fear, if your hypothesis is efficient then you can produce the stock price data for a very low marginal message length cost.

You'd write up the hypothesis as a compressor-of-data; the simplest way being to produce a distribution over stock prices and apply arithmetic coding, though in practice you'd tweak whatever state of the art compressors for stock prices exist.

Of course the side effect of this is that your code references more data, and will likely need longer internal identifiers on it, so if you just split the cost of code across the datasets being compressed, you'd punish the compressors of newspapers and market fear. I would suggest that the solution is to deploy shapely value, with the value being the number of bits saved overall by a single compressor working on all the data sets in a given pool of cooperation.

comment by ocr-fork · 2010-05-31T07:36:05.558Z · score: 0 (2 votes) · LW · GW

First, infer the existence of people, emotions, stock traders, the press, factories, production costs, and companies. When that's done your theory should follow trivially from the source code of your compression algorithm. Just make sure your computer doesn't decay into dust before it gets that far.

comment by Houshalter · 2010-05-30T15:11:41.692Z · score: 0 (2 votes) · LW · GW

1.Set up a camera next to a highway, and record the stream of passing cars. To compress the resulting data, you will need to develop a computational understanding of the visual appearance of automobiles. You will need theories of hubcaps, windshields, license plates, car categories, and so on.

How does that work? The data would be rediculously complex. If you want to sacrifice quality, you could just convert everything to 3d models and compress that, but that would have alot of error compared to an original video. If you were to compress just the video, knowledge about cars wouldn't do any good because there are far to many outliers, the pixel values. Now there might be an complex way to do it, assume the general shapes of objects and limit the possible variation of pixels in those areas, but even then there are to many exceptions to be useful. And how do you compress faces? You might be able to create a method that can identify a face or a specific face from something else, but then its not compression, its object recognition. The same with your bird example which might even be more complex then the car example. And if you decide to not overfit, this sentence could be compressed by "its all made of letters and the most common letter is e, so therefore it is all E's."

comment by Morendil · 2010-05-30T15:19:56.087Z · score: 2 (2 votes) · LW · GW

As the OP says "the key difference is the emphasis on large-scale compression".

Something with 3D models eventually compresses a lot better than recording video, over a long period of time, for the same reason video games are made with 3D models rather than by recording every possible game scenario as a precomputed animation.

comment by Houshalter · 2010-05-30T15:38:12.058Z · score: 1 (3 votes) · LW · GW

But there is always going to be error. Games don't look realistic because of this method. On the video, you could have a spot on the window where it instantly turns blue for no reason. What I'm trying to say is theres to much static, to many outliers. If you were to take a thousand pictures of someone, no two pictures would be the same, even though they appear to be.

comment by SilasBarta · 2010-05-30T19:25:33.323Z · score: 3 (3 votes) · LW · GW

Along with what rwallace said, it might be a good idea to point out that the total message length actually breaks down into three parts:

1) Specification of the encoding: in info-theoretical terms, this maps more common, long strings (observations) to shorter symbols. It plays the role of the theory, in that it implicitly says what observations are most likely. This part tells you how to "decompress" parts 2 and 3 into their original form.

2) The compressed string: this lists all the observed data, using the shortcuts given in 1). Common occurrences are easily described by short strings.

3) The "exceptions" to the theory -- the observations that aren't easily compressed into 2). The longer strings are reserved for these, since they occur the least.

So yes, even if you can't reach perfection, any strong regularity found in the data will allow you to compress it, even if it's a rule with part-3 exceptions.

Daniel's Reusability hypothesis is correct: any insight you gain into a particular domain can be equivalently expressed as a way to compress descriptions of that domain.

However, let me again register my annoyance with this series, not on grounds that it's wrong (so far), but that it's just an exposition of the well-accepted minimum message length formalism. It may not be accepted by scientists, but it's accepted here, and sciences do implicitly use it when they can't do controlled experiments (which are not strictly necessary for science anyway), such as in astronomy and geology.

comment by Houshalter · 2010-05-30T21:15:39.280Z · score: 0 (0 votes) · LW · GW

First of all, compression doesn't have to be limited to just commonly occuring sets. For example, if I create a fractal, you don't have to find out commonly used colors or pixels that occur near each other, but just create a program which generates the exact same fractal. When you do this you add another element to the system - time. How much time do you spend trying to compress something, or vice versa, decompressing it? And no compression system can garuntee that every observed instance can be compressed to less then its original size. Infact, there are always going to be cases where compressing it actually increases the size of the file, possibly significantly, but at least just a few bits.

comment by SilasBarta · 2010-05-30T21:29:55.457Z · score: 0 (0 votes) · LW · GW

First of all, compression doesn't have to be limited to just commonly occuring sets. For example, if I create a fractal, you don't have to find out commonly used colors or pixels that occur near each other, but just create a program which generates the exact same fractal.

That kind of regularity can still be explained in the language of commonly occurring sets: For a fractal, you reserve a symbol for the overall structure, which is common (in that something like it occurs frequently), and you have a symbol to instantiate it, given a particular position, to what scale. Time is handled similarly.

Compression amounts to saying: "This is the structure of the data. Now, here's the remaining data to full in any uncertainty not resolved by its structure." The structure tells you what long symbols you can substitute with short symbols.

And no compression system can garuntee that every observed instance can be compressed to less then its original size.

True, compression only attempts to guarantee that, for the function generating the data, you save, on average, by using a certain compression scheme. If the generating function has no regularity, then you won't be able to compress, even on average.

comment by Houshalter · 2010-05-30T22:20:31.492Z · score: 0 (0 votes) · LW · GW

That kind of regularity can still be explained in the language of commonly occurring sets: For a fractal, you reserve a symbol for the overall structure, which is common (in that something like it occurs frequently), and you have a symbol to instantiate it, given a particular position, to what scale. Time is handled similarly.

Yes but if you have a symbol for every possible fractal, it defeats the purpose. The point was to use time as a memory resource. This also works in reverse with caching and other methods, you can use memory as if it were time. Both, however, only work up to a point.

comment by rwallace · 2010-05-30T18:28:56.646Z · score: 3 (3 votes) · LW · GW

Yes, but if your compressed data contains both a 3-D model and a stream of outlier pixels (to be applied as delta to the result of rendering the model), this is still going to be smaller than if you didn't have a 3-D model and just supplied the video as a pixel stream.

Mind you, if what you want is machine vision, I think in practice you will do better to work on machine vision directly than on video compression. But the compression approach is valid as a matter of mathematical principle.

comment by Daniel_Burfoot · 2010-05-30T18:53:41.117Z · score: 3 (3 votes) · LW · GW

Right - the delta correction has lower entropy than the raw pixel stream. Of course, you have to use a good 3-D model!

Mind you, if what you want is machine vision, I think in practice you will do better to work on machine vision directly than on video compression.

Are you familiar with the field of machine vision? If not, two quick points: 1) the evaluation metrics it employs are truly terrible. 2) Almost all computer vision tasks can be reformulated as specialized image compression tricks (e.g. the stereo correspondence problem can be reformulated as a way of compressing the second image in a stereo pair by predicting the pixels using the first image and a disparity function).

comment by rwallace · 2010-05-30T23:56:47.921Z · score: 0 (0 votes) · LW · GW

I'll admit I'm not particularly familiar with machine vision. I was extrapolating from other areas such as natural language processing, where trying to formulate problems in terms of compression isn't particularly helpful, but if an expert on machine vision says it is helpful in that field, fair enough.

comment by PhilGoetz · 2010-06-01T16:54:44.674Z · score: 2 (2 votes) · LW · GW

natural language processing, where trying to formulate problems in terms of compression isn't particularly helpful

Why do you think that? Statistical natural language processing has been very successful, both for recognition and translation of text, and for speech recognition.

comment by rwallace · 2010-06-01T19:34:42.427Z · score: 0 (0 votes) · LW · GW

Statistical natural language processing yes, but that's not the same thing as text compression (again, there is a mathematical sense in which one could consider them related in principle, but doing text compression and doing natural language processing were still separate activities last I heard).

comment by timtyler · 2010-05-30T08:02:15.100Z · score: -1 (5 votes) · LW · GW

Re: "The key realization is that lossless data compression can only be achieved through empirical science. This is because data compression is impossible for arbitrary inputs: no compressor can ever achieve compression rates of less than M bits when averaged over all M-bit strings."

Well, you only need one empirical observation to build a general purpose compressor - that inputs are more likely than not to be the result of executing short programs. Aftter noticing that you can obtain impressive compression ratios while ignoring the real world completely.

comment by Daniel_Burfoot · 2010-05-30T18:59:05.438Z · score: 0 (0 votes) · LW · GW

Aftter noticing that you can obtain impressive compression ratios while ignoring the real world completely.

Want to bet? :-)

An implicit point of my whole story is the rejection of the idea that good compression rates can be achieved using theory alone. I suppose you can disagree with this, but the history of science seems to be on my side - science never really started making progress until people started systematically checking their theories against empirical data.

comment by timtyler · 2010-05-30T19:48:00.594Z · score: -1 (1 votes) · LW · GW

Why bet? This is a maths debate - and not a very complex one.

comment by Thomas · 2010-05-30T14:27:19.499Z · score: -1 (1 votes) · LW · GW

Aftter noticing that you can obtain impressive compression ratios while ignoring the real world completely.

You don't ignore the real world, if you compress the real world data well. At least you were lucky and guessed its physics.

If you haven't, you can't compress.

comment by timtyler · 2010-05-30T15:00:53.629Z · score: 0 (0 votes) · LW · GW

You don't need to use real-word data in order to build your compressor. You can generate test data yourself using short programs.

comment by Thomas · 2010-05-30T15:06:32.990Z · score: -2 (2 votes) · LW · GW

In that case, you have compressed your "short programs" outputs. They are NOT unrelated to the real world as well, since you wrote them and you belong to the real world.

comment by timtyler · 2010-05-30T15:14:46.721Z · score: 1 (1 votes) · LW · GW

I didn't say they were "related to the real world" in the first place, though.

What I mean is that you can build your Occam-based compressor without using any sensors that sample the world - outside of a closed computer programming domain which includes you and a computer.

comment by Thomas · 2010-05-30T15:39:59.794Z · score: -2 (2 votes) · LW · GW

outside of a closed computer programming domain which includes you and a computer

It is NOT a closed domain. It is very much opened toward the real world. Since we belong there, we are a part of it.

comment by timtyler · 2010-05-30T15:50:30.878Z · score: 1 (1 votes) · LW · GW

Thomas, you do not seem to be doing "sympathetic reading". As a result, you are ignoring what I am trying to say, and are instead focussing on irrelevant trivia. The result seems to be of pretty minimal interest in this case.