Paradigm shifts in forecasting
post by VipulNaik · 2014-05-08T19:38:11.822Z · LW · GW · Legacy · 6 commentsContents
Thomas Kuhn and paradigm shifts Simple methods in science are often "good enough" until you want a much higher "resolution" Complexity and paradigms in the context of forecasting Do complicated methods beat simpler methods? The evolution of complicated methods How do we judge the potential and promise of the new complicated forecasting method? None 6 comments
This post has been written in relation with work I'm doing for the Machine Intelligence Research Institute (MIRI), but hasn't been formally vetted by MIRI. I'm posting this to LessWrong because of its potential interest to a segment of the LessWrong readership. As always, all thoughts are appreciated.
In this post, I'll try to apply some of the scientific theory of paradigm shifts to the domain of forecasting. In a sense, all of science is about making (conditional) predictions about the behavior of systems. Forecasting simply refers to the act of making predictions about the real-world future rather than about a specific controlled experimental setup. So while the domain of forecasting is far more restricted than the domain of science, we can still apply the conceptual framework of paradigm shifts in science to forecasting.
Thomas Kuhn and paradigm shifts
Thomas Kuhn's book The Structure of Scientific Revolutions (Amazon, Wikipedia) provides a detailed descriptive theory of the nature of paradigm shifts in science. Quoting from the Wikipedia page on paradigm shifts:
An epistemological paradigm shift was called a "scientific revolution" by epistemologist and historian of science Thomas Kuhn in his book The Structure of Scientific Revolutions.
A scientific revolution occurs, according to Kuhn, when scientists encounter anomalies that cannot be explained by the universally accepted paradigm within which scientific progress has thereto been made. The paradigm, in Kuhn's view, is not simply the current theory, but the entire worldview in which it exists, and all of the implications which come with it. This is based on features of landscape of knowledge that scientists can identify around them.
There are anomalies for all paradigms, Kuhn maintained, that are brushed away as acceptable levels of error, or simply ignored and not dealt with (a principal argument Kuhn uses to reject Karl Popper's model of falsifiability as the key force involved in scientific change). Rather, according to Kuhn, anomalies have various levels of significance to the practitioners of science at the time. To put it in the context of early 20th century physics, some scientists found the problems with calculating Mercury's perihelion more troubling than the Michelson-Morley experiment results, and some the other way around. Kuhn's model of scientific change differs here, and in many places, from that of the logical positivists in that it puts an enhanced emphasis on the individual humans involved as scientists, rather than abstracting science into a purely logical or philosophical venture.
When enough significant anomalies have accrued against a current paradigm, the scientific discipline is thrown into a state of crisis, according to Kuhn. During this crisis, new ideas, perhaps ones previously discarded, are tried. Eventually a new paradigm is formed, which gains its own new followers, and an intellectual "battle" takes place between the followers of the new paradigm and the hold-outs of the old paradigm. Again, for early 20th century physics, the transition between the Maxwellian electromagnetic worldview and the Einsteinian Relativistic worldview was neither instantaneous nor calm, and instead involved a protracted set of "attacks," both with empirical data as well as rhetorical or philosophical arguments, by both sides, with the Einsteinian theory winning out in the long run. Again, the weighing of evidence and importance of new data was fit through the human sieve: some scientists found the simplicity of Einstein's equations to be most compelling, while some found them more complicated than the notion of Maxwell's aether which they banished. Some found Eddington's photographs of light bending around the sun to be compelling, while some questioned their accuracy and meaning. Sometimes the convincing force is just time itself and the human toll it takes, Kuhn said, using a quote from Max Planck: "a new scientific truth does not triumph by convincing its opponents and making them see the light, but rather because its opponents eventually die, and a new generation grows up that is familiar with it."[1]
After a given discipline has changed from one paradigm to another, this is called, in Kuhn's terminology, a scientific revolution or a paradigm shift. It is often this final conclusion, the result of the long process, that is meant when the term paradigm shift is used colloquially: simply the (often radical) change of worldview, without reference to the specificities of Kuhn's historical argument.
Simple methods in science are often "good enough" until you want a much higher "resolution"
It's worth noting that most paradigm shifts move from simpler, more tractable models to more complicated ones. Initially, the scientific theory is not trying to explain the real world at too fine a resolution, and therefore it is tolerant of large errors. The theories in vogue initially are the simplest among those that can explain the world within the generous margin of error. Over time, as measurement becomes more precise and accurate, and the desire for understanding or engineering at stronger precision levels becomes more important, the focus shifts to finding a model where error rates are lower, accepting a possible increase in the model complexity.
Consider the following examples:
- Classical mechanics is an inferior paradigm to relativistic mechanics, and the distinction begins to matter once we are operating at high speeds or large length and time scales. But for many practical purposes, classical mechanics is still good enough, and the relativistic correction terms don't improve accuracy much. Historically, the accuracy and precision of measurement, and the required accuracy and precision for technological and engineering purposes, were not high enough to give relativity much of an added advantage in predictive power relative to classical mechanics. This is what started changing in the late 19th century, as the framework for electromagnetism was laid out and the inconsistencies in the Mercury perihelion came to be seen as too huge to be explicable by measurement error. In today's world, classical mechanics suffices for most purposes, but relativity is crucial in some cases: it's important in space travel, satellite launches, and global positioning system (GPS)-based navigation. The GPS is absolutely essential to transportation and communication in today's world. And of course, the whole idea of nuclear energy and nuclear bombs was an offshoot of Einstein's equation E = mc2, part of the theory of relativity.
- Classical mechanics is an inferior paradigm to quantum mechanics, and the distinction begins to matter once we are operating at sufficiently small length scales. But for many practical purposes, classical mechanics is still good enough. In some cases, particularly for the behavior of atomic particles, we create classical mechanics-like theories that predict phenomena fairly similar to the actual ones predicted by quantum mechanics, but are more analytically tractable. Again, historically, classical mechanics has been good enough at the macro scale that people have dealt with. But once people wanted a stronger foundation for behavior at the atomic and subatomic scale that could be used in understading chemistry properly, the deficiencies of classical mechanics became clear. In today's world, quantum mechanics underlies quantum chemistry, which in turn is crucial for understanding biochemistry and other small-scale phenomena. It also forms the basis of quantum computing, though the commercial feasibility of the quantum computing paradigm is still uncertain.
- For predicting the structure of molecules, VSEPR theory in chemistry is simple and easy to work with, but far less correct than molecular orbital theory. In most simple contexts, VSEPR theory is a great place to start. But molecular orbital theory is what's needed to get completely correct answers. Often, intermediate theories are used to incorporate some aspects from molecular orbital theory without getting the whole package.
Complexity and paradigms in the context of forecasting
There are often competing methods for forecasting a given indicator. The methods vary considerably in complexity. For instance, persistence is one of the simplest forecasting methods: persistence of levels means that tomorrow will be the same as today, whereas persistence of trends means that the difference between tomorrow and today equals the difference between today and yesterday. Somewhat more sophisticated than simple persistence is various variations of linear regression that are well-suited to time series and tackle the problems both of periodic fluctuation and noise. More sophisticated methods allow for functional forms obtained by additive or multiplicative combination, or composition, of the functional forms used in simpler methods.
Here are some measures of complexity for forecasting methods:
- Complexity measures for computer code executing the method, including length and complexity of the code, as well as time requirements and memory requirements for execution,
- Complexity measures for the underlying mathematical or statistical structure. This could be measured as the amount of mathematical or statistical sophistication needed to understand or implement the model, or the amount of sophistication needed to come up with the model or understand what's going on underneath and why the method works. Or the length of the description of the relevant theorems their proofs, and supporting definitions.
- The diversity of trend types that the model can describe. Some models are capable of only capturing a very restricted class of trends, such as linear trends only or exponential trends only. Others can capture any trend of a general functional form. Yet others can capture practically any continuous function given enough data. In the langauge of the bias-variance dilemma, simple models tend to have higher bias and lower variance and complex models tend to have lower bias and higher variance.
- The minimum amount of data needed for the model to start outperforming other competing models.
Do complicated methods beat simpler methods?
The Makridakis Competitions are often cited as canonical sources of information for how different types of quantitative trend forecasting compare. Makridakis and Hibon draw four conclusions (listed on the linked page and in their papers) of which Finding 1 is most relevant to us: "Statistically sophisticated or complex methods do not necessarily provide more accurate forecasts than simpler ones." Some people (such as Nassim Nicholas Taleb) have used this to argue that sophisticated methods are useless.
The conclusion drawn by Makridakis and Hibon is supported by the data, but there is less to it than meets the eye. As noted earlier in the post, even the most revolutionary and impressive complicated scientific paradigms (such as relativity and quantum mechanics) only rarely outperform the simpler, more widely known paradigms (such as classical mechanics) except in cases that are designed to draw on the strength of the new paradigm (such as high speed or small length scales). And yet, in the cases where those slight improvements matter, we may be able to improve a lot by using the more sophisticated model. Just as knowledge of relativity makes possible a high-precision GPS that would have been impossible otherwise, new forecasting paradigms may make possible things (such as just-in-time inventory management) that would not have been possible at anywhere near that level of quality otherwise.
Of course, the selection of the sophisticated method matters: some sophisticated methods are simply wrong-headed and will therefore underperform simpler methods except in tailor-rigged situations. But the key point here is that an appropriately selected sophisticated model with access to adequate data and computational resources can systematically outperform simpler models. Finding (2) for the Makridakis Competitions is "The relative ranking of the performance of the various methods varies according to the accuracy measure being used." Finding 4 says "The accuracy of the various methods depends on the length of the forecasting horizon involved." The choice of best method also varies across types of time series (so the best method for macroeconomic time series could differ from the best method for time series provided by industries for their production or sales data).
Duncan Watts makes a similar point in his book Everything is Obvious: One You Know The Answer (paraphrased): sophisticated methods don't offer a huge advantage over simpler methods. But the best sophisticated methods are modestly better. And if you're operating at a huge scale (for instance, if you're running an electrical utility that needs to forecast consumer demand, or you're WalMart and you need to manage inventory to minimize waste, or if you're Google or Facebook and need to forecast the amount of traffic in order to budget appropriately for servers), even modest proportional improvements to accuracy can translate to huge absolute reductions in waste and increase in profits.
The evolution of complicated methods
Complicated methods can start off as performing a lot worse than simpler methods, and therefore be deemed useless. But then, at some point, they could start overtaking simpler methods, and once they overtake, they could rapidly gain on the simpler methods. What might change in the process? It could be any of these three, or some combination thereof.
- More data becomes available. This could arise because new measurement setups get deployed, or because existing measurement setups get a longer time series or get refined to a higher resolution. We can argue that with the advent of the Internet, it's much easier to collect a large amount of data, making it possible to use more complicated methods whose relative success depends on having more data available.
- More computational power becomes available. This could arise due to improvements in computing technology, or the building out of more computers. For instance, weather simulations today can use thousands of times as much computing power as weather simulations 30 years ago. Therefore, they can work with finer divisions of the grid on which forecasting is being done, allowing for more accurate weather simulation.
- The method itself, and/or the code to implement it, improve. Tweaks and edge case improvements to existing algorithms can improve them enough that they perform better, even holding data and computational power constant. Sometimes, the improvements require investing in customized hardware or backend software, which take some time to develop after the method is first released. In other cases, it's just about people coming up with new incremental improvements over the idea.
How do we judge the potential and promise of the new complicated forecasting method?
Given a complicated method that people claim could work given sufficient data or computing power that we don't yet have access to, how are we to judge the plausibility of the claims? The question is similar to the general question of whether a new proclaimed model or theory is the harbinger of a paradigm shift in a scientific discipline. I don't have satisfactory answers. In a subsequent post, I'll look at a few historical and current examples of changes of paradigm shifts in forecasting. The examples that I currently plan to cover are:
- The paradigm shift in weather forecasting from the situation where persistence and climatology were the most effective methods to the situation where numerical weather prediction became the most reliable.
- The ongoing potential paradigm shift in the direction of using neural nets for a wide range of forecasting and prediction problems.
Any thoughts on the post as a whole would be appreciated, but I'm particularly interested in thoughts on this last topic in the post.
Thanks to Luke Muehlhauser for helpful early discussions that led to this post and to Jonah Sinick for his thoughts on an early draft of the post.
6 comments
Comments sorted by top scores.
comment by buybuydandavis · 2014-05-10T02:56:06.441Z · LW(p) · GW(p)
I wouldn't generalize too much from a forecasting competition.
Per Wolpert's No Free Lunch theorems, algorithm performance depends on fit to problem domain. The winner is likely a guy who lucked out on the choice of performance evaluation which fit his algorithm better than the competition. It doesn't mean he'll win the next competition. And it doesn't mean he isn't good, but it likely means that he was good and lucky.
How do we judge the potential and promise of the new complicated forecasting method?
Theory and judgment play a part.
When I first saw the Deep Learning method presented by Hinton, I was confident that it would be good without seeing the results, as it looked like a great theoretical approach, attacking the problem the right way.
Same thing with Wolpert and Stacked Generalization.
What to bet on? Things that theoretically look good, but are currently computationally cost prohibitive. As computers improve, there is an algorithmic land grab by researchers rushing into the areas that become computationally tractable.
Replies from: gwern↑ comment by gwern · 2014-05-10T21:11:47.509Z · LW(p) · GW(p)
Per Wolpert's No Free Lunch theorems, algorithm performance depends on fit to problem domain.
Aren't all these forecasting competitions using real data from real-world problems, and so NFL is irrelevant?
Replies from: buybuydandavis↑ comment by buybuydandavis · 2014-05-17T23:20:46.326Z · LW(p) · GW(p)
NFL not relevant to the real world? Would you like to elaborate?
Replies from: gwern↑ comment by gwern · 2014-05-18T01:27:32.494Z · LW(p) · GW(p)
Real-world problems are not a random sampling from all possible problems and there's plenty of structure to exploit, so invoking NFL in this context seems odd to me.
Replies from: buybuydandavis↑ comment by buybuydandavis · 2014-05-18T22:55:38.338Z · LW(p) · GW(p)
A real world competition isn't a random sample of anything. It's a selection of some problems, with some data. The performance of any algorithm will depend on fit to those problems, with those data.
My takeaways from the NFL theorems - the problems in the real world are some structured subset of all possible problems, and the performance of any generalizer for a problem will depend on fit to that problem.
Replies from: gwern