Is simplicity truth indicative?
post by 27chaos · 2015-08-04T17:47:14.869Z · LW · GW · Legacy · 50 commentsContents
50 comments
This essay claims to refute a popularized understanding of Occam's Razor that I myself adhere to. It is confusing me, since I hold this belief at a very deep level that it's difficult for me to examine. Does anyone see any problems in its argument, or does it seem compelling? I specifically feel as though it might be summarizing the relevant Machine Learning research badly, but I'm not very familiar with the field. It also might be failing to give any credit to simplicity as a general heuristic when simplicity succeeds in a specific field, and it's unclear whether such credit would be justified. Finally, my intuition is that situations in nature where there is a steady bias towards growing complexity are more common than the author claims, and that such tendencies are stronger for longer. However, for all of this, I have no clear evidence to back up the ideas in my head, just vague notions that are difficult to examine. I'd appreciate someone else's perspective on this, as mine seems to be distorted.
Essay: http://bruce.edmonds.name/sinti/
50 comments
Comments sorted by top scores.
comment by Manfred · 2015-08-04T20:54:06.518Z · LW(p) · GW(p)
It turns out there's an extremely straightforward mathematical reason why simplicity is to some extent an indicator of high probability.
Consider the list of all possible hypotheses with finite length. We might imagine there being a labeling of this list, starting with hypothesis 1, then hypothesis 2, and continuing on for an infinite number of hypotheses. This list contains the hypotheses capable of being distinguished by a human brain, input into a computer, having their predictions checked against the others, and other nice properties like that. In order to make predictions about which hypothesis is true, all we have to do is assign a probability to each one.
The obvious answer is just to give every hypotheses equal probability. But since there's an infinite number of these hypotheses, that can't work, because we'd end up giving every hypothesis probability zero! So (and here's where it starts getting Occamian) it turns out that any valid probability assignment has to get smaller and smaller as we go to very high numbers in the list (so that the probabilities can all add up to 1). At low numbers in the list the probability is, in general, allowed to go up and down, but hypotheses with very high numbers always have to be low probability.
There's a caveat, though - the position in the list can be arbitrary, and doesn't have to be based on simplicity. But it turns out that it is impossible to make any ordering of hypotheses at all, without having more complicated hypotheses have higher numbers than simpler hypotheses on average.
There's a general argument for this (there's a more specific argument based on universal turing machines that you can find in a good textbook) that's basically a reflection of the fact that there's a most simple hypothesis, but no "most complex" hypothesis, just like how there's no biggest positive integer. Even if you tried to shuffle up the hypotheses really well, you have to have each simple hypothesis end up at some finite place in the list (otherwise they end up at no place in the list and it's not a valid shuffling). And if the simple hypotheses are all at finite places in the list, that means there's still an infinite number of complex hypotheses with higher numbers, so complexity still decreases for large enough places in the list.
Replies from: Slider, 27chaos, cousin_it, Lumifer↑ comment by Slider · 2015-08-04T22:27:19.220Z · LW(p) · GW(p)
Why would the mapping between the language the hypotheses are framed in have impact on which statements are most likley to be true? The article mentions that in domains where the correct hypotheses are complex in the proof language the principle tends to be anti-productive. There is no guarantee that the language is well suited to describe the target phenomenon if we are allowed to freely pick the phenomenon to track!
Wouldn't also any finite complexity class only have finitely many hypotheses in it and wouldn't those also be in a finite numbered index in it? The problem only arises for infinite complexity hypotheses. And it could be argued that if the index is a hyperinteger it can still be a valid placing.
With surreal probability it would be no problem to give an equal infinistemal probability to an infinite list of hypotheses.
Replies from: Manfred, 27chaos, 27chaos, 27chaos↑ comment by Manfred · 2015-08-05T01:01:31.748Z · LW(p) · GW(p)
Wouldn't also any finite complexity class only have finitely many hypotheses in it
Think of it as like the set of all positive integers of finite size. As it turns out, every single integer has finite size! You show me an integer, and I'll show you its size :P But even though each individual element is less than infinity, the size of the set is infinite.
Why would the mapping between the language the hypotheses are framed in have impact on which statements are most likely to be true?
Choosing which language to use is ultimately arbitrary. But because there's no way to assign the same probability to infinitely many discrete things and have the probabilities still add up to one, we're forced into a choice of some "natural ordering of hypotheses" in which the probability is monotonically decreasing. This does not happen because of any specific fact about the external world - this is a property of what it looks to have hypotheses about something that might be arbitrarily complicated.
The article mentions that in domains where the correct hypotheses are complex in the proof language the principle tends to be anti-productive.
Well... it's anti-productive until you eliminate the simple-but-wrong alternatives, and then suddenly it's the only thing allowing you to choose the right hypothesis out of the list that contains many more complex-and-still-accurate hypotheses.
If you want a much better explanation of these topics than I can give, and you like math, I recommend the textbook by Li and Vitanyi.
Replies from: Slider↑ comment by Slider · 2015-08-06T22:27:18.044Z · LW(p) · GW(p)
9 has 4 digits as "1001" in binary and 1 in decimal, so no function from integers to their size. There is no such thing as the size of a integer independent of any digit system used (well you could refer to some set constructions but then the size would be the integer itself).
As surreals we could have ω pieces of equal probability ɛ that sum to 1 exactly (althought ordinal numbers are only applicaple to orders which can be different than cardinal numbers. While for finites there is no big distinciton from ordinal and cardinal, "infinitely many discrete things" might refer to a cardinal concept. However for hypotheses that are listable (such as formed as arbitrary lenght strings of letters from a (finite) alphabeth) the ωth index should be well founded).
It is not about arbitrary complexity but probability over infinite options. We could for example order the hypotheses by the amounts of negation used first and the number of symbols used second. This would not be any less natural and would result in a different probability distribution. Or arguing that the complexity ordereing is the one that produces the "true" probailities is reframing of the question whether the simplicity formulation is truth-indicative.
If I use a complexity-ambivalent method I might need to do fewer eliminations before encountering a working one. There is no need to choose from accurate hypotheses if we know that any of them are true. If I encounter a working hypthesis there is no need to search for a more simpler form of it. Or if I encounter a theory of gravitation using ellipses should I countinue the search to find one that uses simpler concepts like circles only?
Replies from: 27chaos, 27chaos↑ comment by 27chaos · 2015-08-11T23:14:28.758Z · LW(p) · GW(p)
I think this is relevant: https://en.wikipedia.org/wiki/Bertrand_paradox_(probability)
The approach of the final authors mentioned on the page seems especially interesting to me. I also am interested to note that their result agrees with Jaynes'. Universability seems to be important to all the most productive approaches there.
Or arguing that the complexity ordereing is the one that produces the "true" probailities is reframing of the question whether the simplicity formulation is truth-indicative.
If the approach that says simplicity is truth-indicative is self-consistent, that's at least something. I'm reminded of the LW sequence that talks about toxic vs healthy epistemic loops.
If I encounter a working hypothesis there is no need to search for a more simpler form of it.
This seems likely to encourage overfitted hypotheses. I guess the alternative would be wasting effort on searching for simplicity that doesn't exist, though. Now I am confused again, although in a healthier and more abstract way than originally. I'm looking for where the problem in anti-simplicity arguments lies rather than taking them seriously, which is easier to live with.
Honestly, I'm starting to feel as though perhaps the easiest approach to disproving the author's argument would be to deny his assertion that processes in Nature which are simple are relatively uncommon. From off the top of my head, argument one is replicators, argument two is that simpler processes are smaller and thus more of them fit into the universe than complex ones would, argument three is the universe seems to run on math (might be begging the question a bit, although I don't think so, since it's kinda amazing that anything more meta than perfect atomist replication can lead to valid inference - again the connection to universalizability surfaces), argument four is an attempt to undeniably avoid begging the question inspired by Descartes: if nothing else we have access to at least one form of Nature unfiltered by our perceptions of simplicity : the perceptions themselves, which via anthropic type induction arguments we should assume-more-than-not to be of more or less average representativeness. (Current epistemic status: playing with ideas very nonrigorously, wild and free.)
↑ comment by 27chaos · 2015-08-11T23:12:05.442Z · LW(p) · GW(p)
I think this is relevant: https://en.wikipedia.org/wiki/Bertrand_paradox_(probability)
The approach of the final authors mentioned on the page seems especially interesting to me. I also am interested to note that their result agrees with Jaynes'. Universability seems to be important to all the most productive approaches there.
If I encounter a working hypothesis there is no need to search for a more simpler form of it.
This seems likely to encourage overfitted hypotheses.
↑ comment by 27chaos · 2015-08-11T23:02:09.128Z · LW(p) · GW(p)
I think this is relevant: https://en.wikipedia.org/wiki/Bertrand_paradox_(probability)
The approach of the final authors mentioned on the page seems especially interesting to me. I also am interested to note that their result agrees with Jaynes'. Universability seems to be important to all the most productive approaches there.
↑ comment by 27chaos · 2015-08-11T23:00:14.104Z · LW(p) · GW(p)
I think this is relevant: https://en.wikipedia.org/wiki/Bertrand_paradox_(probability)#Jaynes.27_solution_using_the_.22maximum_ignorance.22_principle
↑ comment by 27chaos · 2015-08-06T16:35:16.108Z · LW(p) · GW(p)
Thanks for this! Apparently, among many economists Occam's Razor is viewed as just a modelling trick, judging from the conversations on Reddit I've had recently. I'd felt that perspective was incorrect for a while, but after encountering it so many times, and then later on being directed to this paper, I'd begun to fear my epistemology was built on shaky foundations. It's relieving to see that's not the case.
It turns out there's an extremely straightforward mathematical reason why simplicity is to some extent an indicator of high probability.
Is there anything ruling out a bias towards simplicity that is extremely small, or are there good reasons to think the bias would be rather large? Figuring out how much predictive accuracy to exchange for theory conciseness seems like a tough problem, possibly requiring some arbitrariness.
↑ comment by cousin_it · 2015-08-05T11:51:12.025Z · LW(p) · GW(p)
That only works if you have a countable set of mutually exclusive hypotheses, and exactly one of them is true. Not all worlds are like that. For example, if the "world" is a single real number picked uniformly from [0,1], then it's hard to say what the hypotheses should be.
If hypotheses aren't restricted to being mutually exclusive, the approach doesn't work. For example, if you randomly generate sentences about the integers in some formal theory, then short sentences aren't more likely to be true than long ones. That leads to a problem if you want to apply Occam's razor to choosing physical theories, which aren't mutually exclusive.
Another reason to prefer the simplest theories that fit observations well is that they make life easier for engineers. Kevin Kelly's Occam efficiency theorem is related, but the idea is really simpler than that.
↑ comment by Lumifer · 2015-08-04T21:06:31.312Z · LW(p) · GW(p)
It turns out there's an extremely straightforward mathematical reason why simplicity is to some extent an indicator of high probability.
And what exactly does that bit of mathematical wankery with infinite lists have to do with trying to figure out which maps are better in our reality? Does it have any practical application?
comment by Slider · 2015-08-04T18:19:04.455Z · LW(p) · GW(p)
If you have information that simplicity works good in the field of application then the success should be attributed to this information rather than simplicity per se. There are no free lunches and the prior information about the fittness of simple theories is your toll for being able to occam your way forward.
The point is that two theories of incomparable theory branches can't be ordered with occam but one being an elaboration of another (ie up or down the same branch) can. Other uses are better understood as unrelated to this idea but still falsely attributed to it.
Replies from: 27chaos↑ comment by 27chaos · 2015-08-06T16:42:30.330Z · LW(p) · GW(p)
If you have information that simplicity works good in the field of application then the success should be attributed to this information rather than simplicity per se.
Why? Is this really just an attempt to emphasize that in some domains insisting on simplicity may be counterproductive? While that's true theoretically, I feel like such domains are highly rare in practice, and most people are not overdemanding of simplicity. Thus such an argument feels more like an attempt to carve out in theoretical space a highly applicable get-out-of-jail-free card than an attempt to guide arguments closer to truth.
Replies from: Slider↑ comment by Slider · 2015-08-06T21:53:48.276Z · LW(p) · GW(p)
It comes form being able to tell which part of the success is because of your method and which part is the data that you fed to your method. There was a big listing of domain knowledge assumptions and this seems like a one of the first things to assume about a domain. When one knows the difference between knowledge and assumtions it isn't that hard to take simplicity preference as an assumption instead of fact ie the proper attribution doesn't really increase ones cognitive workload.
It can be okay to guess that simplicity works good (especially when one knows that the odds are good) but then you are not knowing.
comment by SilentCal · 2015-08-05T15:44:09.808Z · LW(p) · GW(p)
My reading of the main argument here is that human theories are more likely when simple, not because of any fact about theory-space but because of the nature of humans' theory-generation process. In particular, theories that aren't correct acquire elaborations to make them give the right predictions, a la epicycles. This requires that theories are more likely if they can make correct predictions without lots of elaborations (or else theories with epicycles would be correct as often as those without). But in order for this rule to differ from Occam's Razor, we need to be able to decide what's the core of a theory and what's 'elaboration', so we can penalize only for the latter. I can't see any way to do this, and the author doesn't offer one either. And when you think in a Turing-Machine type formalism where hypotheses are bit strings representing programs, separating the 'core' of the bit-string-program from the 'elaborations' of the bit-string-program doesn't seem likely to succeed.
Replies from: 27chaos↑ comment by 27chaos · 2015-08-06T01:27:22.412Z · LW(p) · GW(p)
But in order for this rule to differ from Occam's Razor, we need to be able to decide what's the core of a theory and what's 'elaboration', so we can penalize only for the latter. I can't see any way to do this, and the author doesn't offer one either.
I think this is the flaw, thank you. I was very confused for a while.
comment by Viliam · 2015-08-05T10:10:41.618Z · LW(p) · GW(p)
To me it seems there are two different arguments for Occam's Razor.
1) Sometimes relatively short explanations can explain things about our universe. This seems to be a fact about our universe. There could hypothetically exist universes with extremely complicated fundamental laws, but our universe doesn't seem to be one of them.
If theory of relativity or quantum physics seem complicated to you, imagine universes where the similar equations would contain thousands or millions of symbols, and couldn't be further simplified; that's what I mean by "extremely complicated".
2) For every explanation "X", you can make any number of explanations in form "X, plus some additional detail that is difficult to verify quickly". It is better to just remember "X", until some of those additional details is supported by evidence. This does not mean that the additional details must be wrong, it's just... there are millions of possible details, and you wouldn't know which one of them is the right one anyway. Using the simplest option that corresponds to known fact is more economical.
comment by hosford42 · 2015-08-12T22:40:02.459Z · LW(p) · GW(p)
You should read up on regularization) and the no free lunch theorem, if you aren't already familiar with them.
A theory is a model for a class of observable phenomena. A model is constructed from smaller primitive (atomic) elements connected together according to certain rules. (Ideally, the model's behavior or structure is isomorphic to that of the class of phenomena it is intended to represent.) We can take this collection of primitive elements, plus the rules for how they can be connected, as a modeling language. Now, depending on which primitives and rules we have selected, it may become more or less difficult to express a model with behavior isomorphic to the original, requiring more or fewer primitive elements. This means that Occam's razor will suggest different models as the simplest alternatives depending on which modeling language we have selected. Minimizing complexity in each modeling language lends a different bias toward certain models and against other models, but those biases can be varied or even reversed by changing the language that was selected. There is consequently nothing mathematically special about simplicity that lends an increased probability of correctness to simpler models.
That said, there are valid reasons to use Occam's razor nonetheless, and not just the reasons the author of this essay lists, such as resource constraint optimization. In fact, it is reasonable to expect that using Occam's razor does increase the probability of correctness, but not for the reasons that simplicity alone is good. Consider the fact that human beings evolved in this environment, and that our minds are therefore tailored by evolution to be good at identifying patterns that are common within it. In other words, the modeling language used for human cognition has been optimized to some degree to easily express patterns that are observable in our environment. Thus, for the specific pairing of the human environment with the modeling language used by human minds, a bias towards simpler models probably is indicative of an increased likelihood of that model being appropriate to the observed class of phenomena, despite simplicity being irrelevant in the general case of any arbitrary pairing of environment and modeling language.
Replies from: 27chaos↑ comment by 27chaos · 2015-08-13T00:15:44.205Z · LW(p) · GW(p)
You're speaking as though complexity is measuring the relationship between a language and the phenomena, or the map and a territory. But I'm pretty sure complexity is actually an objective and language-independent idea, represented in its pure form in Salmonoff Induction. Complexity is a property that's observed in the world via senses or data input mechanisms, not just something within the mind. The ease of expressing a certain statement might change depending on the language you're using, but the statement's absolute complexity remains the same no matter what. You don't have to measure everything within the terms of one particular language, you can go outside the particulars and generalize.
Replies from: hosford42↑ comment by hosford42 · 2015-08-13T18:48:37.167Z · LW(p) · GW(p)
you can go outside the particulars and generalize.
You can't get to the outside. No matter what perspective you are indirectly looking from, you are still ultimately looking from your own perspective. (True objectivity is an illusion - it amounts to you imagining you have stepped outside of yourself.) This means that, for any given phenomenon you observe, you are going to have to encode that phenomenon into your own internal modeling language first to understand it, and you will therefore perceive some lower bound on complexity for the expression of that phenomenon. But that complexity, while it seems intrinsic to the phenomenon, is in fact intrinsic to your relationship to the phenomenon, and your ability to encode it into your own internal modeling language. It's a magic trick played on us by our own cognitive limitations.
Complexity is a property that's observed in the world via senses or data input mechanisms, not just something within the mind.
Senses and data input mechanisms are relationships. The observer and the object are related by the act of observation. You are looking at two systems, the observer and the object, and claiming that the observer's difficulty in building a map of the object is a consequence of something intrinsic to the object, but you forget that you are part of this system, too, and your own relationship to the object requires you, too, to build a map of it. You therefore can't use this as an argument to prove that this difficulty of mapping that object is intrinsic to the object, rather than to the relationship of observation.
For any given phenomenon A, I can make up a language L1 where A corresponds to a primitive element in that language. Therefore, the minimum description length for A is 1 in L1. Now imagine another language, L2, for which A has a long description length in L2. The invariance theorem for Kolmogorov complexity, which I believe is what you are basing your intuition on, can be misinterpreted as saying that there is some minimal encoding length for a given phenomenon regardless of language. This is not what that theorem is actually saying, though. What it does say is that the difficulty of encoding phenomenon A in L2 is at most equal to the difficulty of encoding A in L1 and then encoding L1 in L2. In other words, given that A has a minimum description length of 1 in L1, but a very long description length in L2, we can be certain that L1 also has a long description length in L2. In terms of conceptual distance, all the invariance theorem says is that if L1 is close to A, then it must be far from L2, because L2 is far from A. It's just the triangle inequality, in another guise. (Admittedly, conceptual distance does not have an important property we typically expect of distance measures, that the distance from A to B is the same as the distance from B to A, but that is irrelevant here.)
Replies from: 27chaos↑ comment by 27chaos · 2015-08-13T23:36:31.122Z · LW(p) · GW(p)
You can't get to the outside. No matter what perspective you are indirectly looking from, you are still ultimately looking from your own perspective. (True objectivity is an illusion - it amounts to you imagining you have stepped outside of yourself.) This means that, for any given phenomenon you observe, you are going to have to encode that phenomenon into your own internal modeling language first to understand it, and you will therefore perceive some lower bound on complexity for the expression of that phenomenon. But that complexity, while it seems intrinsic to the phenomenon, is in fact intrinsic to your relationship to the phenomenon, and your ability to encode it into your own internal modeling language. It's a magic trick played on us by our own cognitive limitations.
I think my objection stands regardless of whether there is one subjective reality or one objective reality. The important aspect of my objection is the "oneness", not the objectivity, I believe. Earlier, you said:
depending on which primitives and rules we have selected... Occam's razor will suggest different models... Minimizing complexity in each modeling language lends a different bias toward certain models and against other models, but those biases can be varied or even reversed by changing the language that was selected. There is consequently nothing mathematically special about simplicity that lends an increased probability of correctness to simpler models.
But since we are already, inevitably, embedded within a certain subjective modelling language, we are already committed to the strengths and weaknesses of that language. The further away from our primitives we get, the worse a compromise we end up making, since some of the ways in which we diverge from our primitives will be "wrong", making sacrifices that do not pay off. The best we can do is break even, therefore the walk away from our primitives that we take is a biasedly random one, and will drift towards worse results.
There might also be a sense in which the worst we can do is break even, but I'm pretty sure that way madness lies. Defining yourself to be correct doesn't count for correctness, in my book of arbitrary values. Less subjective argument for this view of values: Insofar as primitives are difficult to change, when you think you've changed a primitive it's somewhat likely that what you've actually done is increased your internal inconsistency (and coincidentally, thus violated the axioms of NFL).
Whether you call the primitives "objective" or "subjective" is besides the point. What's important is that they're there at all.
comment by MockTurtle · 2015-08-06T09:38:25.411Z · LW(p) · GW(p)
Looking at the machine learning section of the essay, and the paper it mentions, I believe the author to be making a bit too strong a claim based on the data. When he says:
"In some cases the simpler hypotheses were not the best predictors of the out-of-sample data. This is evidence that on real world data series and formal models simplicity is not necessarily truth-indicative."
... he fails to take into account that many more of the complex hypotheses get high error rates than the simpler hypotheses (despite a few of the more complex hypotheses getting the smallest error rates in some cases), which still says that when you have a whole range of hypotheses, you're more likely to get higher error rates when choosing a single complex one than a single simple one. It sounds like he says Occam's Razor is not useful just because the simplest hypothesis isn't ALWAYS the most likely to be true.
Similarly, when he says:
"In a following study on artificial data generated by an ideal fixed 'answer', (Murphy 1995), it was found that a simplicity bias was useful, but only when the 'answer' was also simple. If the answer was complex a bias towards complexity aided the search."
This is not actually relevant to the discussion of whether simple answers are more likely to be fact than complex answers, for a given phenomenon. If you say "It turns out that you're more likely to be wrong with a simple hypothesis when the true answer is complex", this does not affect one way or the other the claim that simple answers may be more common than complex answers, and thus that simple hypotheses may be, all else being equal, more likely to be true than complex hypotheses when both match the observations.
That being said, I am sympathetic to the author's general argument. While complexity (elaboration), when humans are devising theories, tends to just mean more things which can be wrong when further observations are made, this does not necessarily point to whether natural phenomena is generally 'simple' or not. If you observe only a small (not perfectly representative) fraction of the phenomenon, then a simple hypothesis produced at this time is likely to be proven wrong in the end. I'm not sure if this is really an interesting thing to say, however - when talking about the actual phenomena, they are neither really simple nor complex. They have a single true explanation. It's only when humans are trying to establish the explanation based on limited observation that simplicity and complexity come into it.
Replies from: 27chaos↑ comment by 27chaos · 2015-08-06T16:25:39.650Z · LW(p) · GW(p)
Did you look up the papers he referenced, then? Or are you speaking just based on your impression of his summaries? I too thought that his summaries were potentially misleading, but I failed to track down the papers he mentioned to verify that for certain.
I'm not sure if this is really an interesting thing to say, however - when talking about the actual phenomena, they are neither really simple nor complex. They have a single true explanation. It's only when humans are trying to establish the explanation based on limited observation that simplicity and complexity come into it.
This perspective is new to me. What are your thoughts on things like Salmonoff induction? It seems to me like that's sufficiently abstract that it requires simplicity is a meaningful idea even outside the human psyche. I cannot really imagine any thinking-like process that doesn't involve notions of simplicity.
Replies from: MockTurtle↑ comment by MockTurtle · 2015-08-11T15:07:43.150Z · LW(p) · GW(p)
The first paper he mentions in the machine learning section can be found here, if you'd like to take a look: Murphy and Pazzani 1994 I had more trouble finding the others which he briefly mentions, and so relied on his summary for those.
As for the 'complexity of phenomena rather than theories' bit I was talking about, your reminder of Solomonoff induction has made me change my mind, and perhaps we can talk about 'complexity' when it comes to the phenomena themselves after all.
My initial mindset (reworded with Solomonoff induction in mind) was this: Given an algorithm (phenomenon) and the data it generates (observations), we are trying to come up with algorithms (theories) that create the same set of data. In that situation, Occam's Razor is saying "the shorter the algorithm you create which generates the data, the more likely it is to be the same as the original data-generating algorithm". So, as I said before, the theories are judged on their complexity. But the essay is saying, "Given a set of observations, there are many algorithms that could have originally generated it. Some algorithms are simpler than others, but nature does not necessarily choose the simplest algorithm that could generate those observations."
So then it would follow that when searching for a theory, the simplest ones will not always be the correct ones, since the observation-generating phenomenon was not chosen by nature to necessarily be the simplest phenomenon that could generate those observations. I think that may be what the essay is really getting at.
Someone please correct me if I'm wrong, but isn't the above only kinda valid when our observations are incomplete? Intuitively, it would seem to me that given the FULL set of possible observations from a phenomenon, if you believe any theory but the simplest one that generates all of them, surely you're making irrefutably unnecessary assumptions? The only reason you'd ever doubt the simplest theory is if you think there are extra observations you could make which would warrant extra assumptions and a more complex theory...
Replies from: 27chaos↑ comment by 27chaos · 2015-08-11T22:50:18.095Z · LW(p) · GW(p)
So then it would follow that when searching for a theory, the simplest ones will not always be the correct ones, since the observation-generating phenomenon was not chosen by nature to necessarily be the simplest phenomenon that could generate those observations. I think that may be what the essay is really getting at.
It might be a difference of starting points, then. We can either start with a universal approach, a broad prior, and use general heuristics like Occam's Razor, then move towards the specifics of a situation, or we can start with a narrow prior and a view informed by local context, to see how Nature typically operates in such domains according to the evidence of our intuitions, then try to zoom out. Of course both approaches have advantages in some cases, so what's actually being debated is their relative frequency.
I'm not sure of any good way to survey the problem space in an unbiased way to assess whether or not this assertion is typically true (maybe Monte Carlo simulations over random algorithms or something ridiculous like that?), but the point that adding unnecessary additional assumptions to a theory is flawed practice seems like a good heuristic argument suggesting we should generally assume simplicity. Does the fact that naive neural nets almost always fail when applied to out of sample data constitute a strong general argument against the anti-universalizing approach? Or am I just mixing metaphors recklessly here, with this whole "localism" thing? Simplicity and generalizability are more or less the same thing, right? Or is that question assuming the conclusion once again?
Replies from: MockTurtle↑ comment by MockTurtle · 2015-08-13T12:03:37.048Z · LW(p) · GW(p)
Does the fact that naive neural nets almost always fail when applied to out of sample data constitute a strong general argument against the anti-universalizing approach?
I think this demonstrates the problem rather well. In the end, the phenomenon you are trying to model has a level of complexity N. You want your model (neural network or theory or whatever) to have the same level of complexity - no more, no less. So the fact that naive neural nets fail on out of sample data for a given problem shows that the neural network did not reach sufficient complexity. That most naive neural networks fail shows that most problems have at least a bit more complexity than that embodied in the simplest neural networks.
As for how to approach the problem in view of all this... Consider this: for any particular problem of complexity N, there are N - 1 levels of complexity below it, which may fail to make accurate predictions due to oversimplification. And then there's an infinity of complexity levels above N, which may fail to make accurate predictions due to overfitting. So it makes sense to start with simple theories, and keep adding complexity as new observations arrive, and gradually improve the predictions we make, until we have the simplest theory we can which still produces low errors when predicting new observations.
I say low errors because to truly match all observations would certainly be overfitting! So there at the end we have the same problem again, where we trade off accuracy on current data against overfitting errors on future data... Simple (higher errors) versus complex (higher overfitting)... At the end of the process, only empiricism can help us find the theory that produces the lowest error on future data!
Replies from: Lumifer↑ comment by Lumifer · 2015-08-13T15:08:03.482Z · LW(p) · GW(p)
So the fact that naive neural nets fail on out of sample data for a given problem shows that the neural network did not reach sufficient complexity.
This is one possibility. Another, MUCH more common in practice, is that your NN overfitted the in-sample data and so trivially failed at out-of-sample forecasting.
To figure out the complexity of the process you're trying to model, you first need to be able to separate features of that process from noise and this is far from a trivial exercise.
Replies from: 27chaos↑ comment by 27chaos · 2015-08-13T15:12:49.251Z · LW(p) · GW(p)
This is more along the lines of what I was thinking. Most instances of complexity that seem like they're good are in practice going to be versions of overfitting to noise. Or, perhaps stated more concisely and powerfully, noise and simplicity are opposites (information entropy), thus if we dislike noise we should like simplicity. Does this seem like a reasonable perspective?
Replies from: Lumifer↑ comment by Lumifer · 2015-08-13T15:39:04.189Z · LW(p) · GW(p)
noise and simplicity are opposites (information entropy), thus if we dislike noise we should like simplicity. Does this seem like a reasonable perspective?
Not quite. Noise and simplicity are not opposites. I would say that the amount of noise in the data (along with the amount of data) imposes a limit, an upper bound, on the complexity that you can credibly detect.
Basically, if your data is noisy you are forced to consider only low-complexity models.
Replies from: 27chaos↑ comment by 27chaos · 2015-08-13T23:22:01.561Z · LW(p) · GW(p)
Can you elaborate on why you think it's a boundary, not an opposite? I still feel like it's an opposite. My impression, from self-study, is that randomness in information means the best way to describe eg a sequence of coin flips is to copy the sequence exactly, there is no algorithm or heuristic that allows you to describe the random information more efficiently, like "all heads" or "heads, tails, heads, tails, etc." That sort of efficient description of information seems identical to simplicity to me. If randomness is defined as the absence of simplicity...
I guess maybe all of this is compatible with an upper bound understanding, though. What is there that distinguishes the upper bound understanding from my "opposites" understanding, that goes your way?
Replies from: Lumifer↑ comment by Lumifer · 2015-08-14T00:47:50.168Z · LW(p) · GW(p)
Noise is not randomness. What is "noise" depends on the context, but generally it means the part of the signal that we are not interested in and do not care about other than that we'd like to get rid of it.
But we may be talking in different frameworks. If you define simplicity as the opposite (or inverse) of Kolmogorov complexity and if you define noise as something that increases the Kolmogorov complexity then yes, they are kinda opposite by definition.
Replies from: 27chaos↑ comment by 27chaos · 2015-08-14T01:27:25.832Z · LW(p) · GW(p)
I don't think we're talking in different frameworks really, I think my choice of words was just dumb/misinformed/sloppy/incorrect. If I had originally stated "randomness and simplicity are opposites" and then pointed out that randomness is a type of noise, (I think it is perhaps even the average of all possible noisy biases, because all biases should cancel?) would that have been a reasonable argument, judged in your paradigm?
Replies from: Lumifer↑ comment by Lumifer · 2015-08-14T01:39:44.475Z · LW(p) · GW(p)
We still need to figure out the framework.
In a modeling framework (and we started in the context of neural nets which are models) "noise" is generally interpreted as model residuals -- the part of data that you are unwilling or unable to model. In the same context "simplicity" usually means that the model has few parameters and an uncomplicated structure. As you can see, they are not opposites at all.
In the information/entropy framework simplicity usually means low Kolmogorov complexity and I am not sure what would "noise" mean.
When you say "randomness is a type of noise", can you define the terms you are using?
Replies from: 27chaos↑ comment by 27chaos · 2015-08-24T18:52:38.949Z · LW(p) · GW(p)
Let me start over.
Randomness is maximally complex, in the sense that a true random output cannot easily be predicted or efficiently be described. Simplicity is minimally complex, in that a simple process is easy to describe and its output easy to predict. Sometimes, part of the complexity of a complex explanation will be the result of "exploited" randomness. Randomness cannot be exploited for long, however. After all, it's not randomness if it is predictable. Thus a neural net might overfit its data only to fail at out of sample predictions, or a human brain might see faces in the clouds. If we want to avoid this, we should favor simple explanations over complex explanations, all else being equal. Simplicity's advantage is that it minimizes our vulnerability to random noise.
The reason that complexity is more vulnerable to random noise is that complexity involves more pieces of explanation and consequently is more flexible and sensitive to random changes in input, while simplicity uses large important concepts. In this, we can see that the fact complex explanations are easier to use than simple explanations when rationalizing failed theories is not a mere accident of human psychology, it emerges naturally from the general superiority of simple explanations.
Replies from: Lumifer↑ comment by Lumifer · 2015-08-24T19:18:40.432Z · LW(p) · GW(p)
Randomness is maximally complex.
I am not sure this is a useful way to look at things. Randomness can be very different. All random variables are random in some way, but calling all of them "maximally complex" isn't going to get you anywhere.
Outside of quantum physics, I don't know what is "a true random output". Let's take a common example: stock prices. Are they truly random? According to which definition of true randomness? Are they random to a superhuman AI?
it's not randomness if it is predictable
Let's take a random variable ~N(0,1), that, normally distributed with the mean of zero and the standard deviation of 1. Is it predictable? Sure. Predictability is not binary, anyway.
we should favor simple explanations over complex explanations
That's just Occam's Razor, isn't it?
Simplicity's advantage is that it minimizes our vulnerability to random noise.
How do you know what is random before trying to model it? Usually simplicity doesn't minimize your vulnerability, it just accepts it. It is quite possible for the explanation to be too simple in which case you treat as noise (as so are vulnerable to it) things which you could have modeled by adding some complexity.
complexity ... is more flexible and sensitive to random changes in input
I don't know about that. This is more a function of your modeling structure and the whole modeling process. To give a trivial example, specifying additional limits and boundary conditions adds complexity to a model, but reduces its flexibility and sensitivity to noise.
general superiority of simple explanations
That's a meaningless expression until you specify how simple. As I mentioned, it's clearly possible for explanations and models to be too simple.
Replies from: 27chaos, 27chaos↑ comment by 27chaos · 2015-09-23T19:11:41.054Z · LW(p) · GW(p)
After giving myself some time to think about this, I think you are right and my argument was flawed. On the other hand, I still think there's a sense in which simplicity in explanations is superior to complexity, even though I can't produce any good arguments for that idea.
Replies from: Lumifer↑ comment by Lumifer · 2015-09-24T14:39:55.685Z · LW(p) · GW(p)
I would probably argue that the complexity of explanations should match the complexity of the phenomenon you're trying to describe.
Replies from: 27chaos↑ comment by 27chaos · 2015-11-05T21:58:43.192Z · LW(p) · GW(p)
After a couple months more thought, I still feel as though there should be some more general sense in which simplicity is better. Maybe because it's easier to find simple explanations that approximately match complex truths than to find complex explanations that approximately match simple truths, so even when you're dealing with a domain filled with complex phenomena it's better to use simplicity. On the other hand, perhaps the notion that approximations matter or can be meaningfully compared across domains of different complexity is begging the question somehow.
comment by [deleted] · 2015-08-08T03:48:37.035Z · LW(p) · GW(p)
Since when does occam's razor have to do with the truth? It's just easier to communicate simpler things. It's a communication hereustic, not a 'objective truth hereustic'.
I'd say it leads people to fall into the bias of selective skepticism. I hope that's the right term! I'll chuck in some links to sequence posts soon. Edit: here it is!. Though I'm somewhat hesitant to embrace EY's position since I'd imagine one would end up like that donkey that can't decide whether to eat the bale of hay to the left or to the right without privileging a hypothesis. In summary:
A motivated skeptic asks if the evidence compels them to accept the conclusion; a motivated credulist asks if the evidence allows them to accept the conclusion.
Unexpected find! Yesterday I found lots of gore videos on /b/. I told myself that I should watch them because conventional media was blinding me to the horrid truths of the world and this would make me less biased, more risk aware and more cautious. Instead, I felt really really depressed. I don't know if it preceeded or proceeded from /b/, since I don't tend to go there unless I feel somewhat purposeless (or I'm looking for dirty things. Anyway, today I was wondering about whether my suicidality could have come from fear that I would end up tortured or in lots of pain, and my suicidal thoughts where avoidance coping from percieved hopelessness and the extreme negative value of painful outcomes. Today I did some research on other people. Avoidance coping is actually associated with decreased suicide/depression risk. Contrary to what everyone says, and many depressed, suicidal people think, they aren't actually trying to escape from some global negative judgement about the world.
If I had simply gone with the simplest answer, that I was avoidance coping, that wouldn't have been any more true than the answer after more research.
comment by Elo · 2015-08-05T13:34:28.726Z · LW(p) · GW(p)
I notice I am confused. I have updated my understanding of Occam's razor numerous times in the past.
In this case I am helped by wikipedia; " The principle states that among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected. Other, more complicated solutions may ultimately prove to provide better predictions, but—in the absence of differences in predictive ability—the fewer assumptions that are made, the better." https://en.wikipedia.org/wiki/Occam%27s_razor
That statement would suggest that the essay is entirely not about Occham's Razor but about "simple theories".
Whenever I explain Occham today I try to describe. Imagine if you will; a toaster. Bread goes into the "MAGICAL BLACK TOASTER BOX"[1], and toast comes out. As a theory for how a toaster works; its fine; except that it relies on Magic and an assumption of what that is. For practical purposes, knowing that a toaster just works might be more efficient in life than the real explanation of how a toaster works, (electricity passes through coils which heat up due to electrical resistance which cause enough heat to cook a piece of bread and turn it into toasted bread[2]). However if you compare the two explanations; one doesn't really explain how a toaster works; and the other leaves a lot less unexplained. (sure electricity is unexplained but thats a lot less than MAGIC TOASTING BOX). So Occam would suggest that [2] is more likely to be true than [1].
Does this make sense? Have I got my understanding of toasting magic and Occam correct?
Replies from: Viliam↑ comment by Viliam · 2015-08-06T08:27:24.528Z · LW(p) · GW(p)
I am not sure about this, but it seems to me that the electricity explanation starts to win more clearly when you do multiple experiments. (Also, "simplicity of the explanation" is in the mind, because it means the amount of information in addition to what you already know.)
If you would already know everything about laws of physics, then "magic" would be an implausible explanation; the magical toaster would require you to change your model of the world, and that would be too much work. But let's suppose you know nothing about physics; you are just a poorly educated person with a toaster in your hands. At that moment "magic" may be a better explanation than "electricity"...
But then you do multiple experiments. What happens if you put inside a thicker or a thinner slice of bread? Something other than bread? What if you unplug the toaster? The more experiments you do, the more complicated becomes the "magic" explanation, because it must explain why the magic stops working when the toaster is unplugged, etc. (Remember, we are trying to find the simplest explanation that fits the data. A simple wrong model may perfectly fit one data point, but when you have tons of data, the complexity of your model must approach the complexity of the territory.) At some moment, it becomes easier to say "electricity" than to say "magic (which works exactly as the electricity would do)".
Replies from: Elocomment by RomeoStevens · 2015-08-04T20:44:54.827Z · LW(p) · GW(p)
From the article.
an inherent bias towards simplicity in the natural world
uhhhh....
Replies from: 27chaos↑ comment by 27chaos · 2015-08-06T16:45:07.764Z · LW(p) · GW(p)
I think the idea that the world is rather simple is correct. Of course, our notions of simplicity are calibrated to this world, but when you take counterfactuals into consideration I think it's clear basic math works much more often than it imaginably might in a more hostile universe.
Replies from: RomeoStevens↑ comment by RomeoStevens · 2015-08-06T19:41:32.405Z · LW(p) · GW(p)
I wasn't arguing against the notion. I was pointing out that the author grants the thesis they are arguing against.
comment by Lumifer · 2015-08-04T19:54:01.858Z · LW(p) · GW(p)
Occam's Razor is a heuristic, that is, a convenient rule of thumb that usually gives acceptable results. It is not a law of nature or a theorem of mathematics (no, not an axiom either). It basically tells you what humans are likely to find more useful. It does not tell you -- and does not even pretend to tell you -- what is more likely to be true.
Replies from: torekp