Paradigm shifts in forecasting

post by VipulNaik · 2014-05-08T19:38:11.822Z · LW · GW · Legacy · 6 comments

Contents

  Thomas Kuhn and paradigm shifts
  Simple methods in science are often "good enough" until you want a much higher "resolution"
  Complexity and paradigms in the context of forecasting
  Do complicated methods beat simpler methods?
  The evolution of complicated methods
  How do we judge the potential and promise of the new complicated forecasting method?
None
6 comments

This post has been written in relation with work I'm doing for the Machine Intelligence Research Institute (MIRI), but hasn't been formally vetted by MIRI. I'm posting this to LessWrong because of its potential interest to a segment of the LessWrong readership. As always, all thoughts are appreciated.

In this post, I'll try to apply some of the scientific theory of paradigm shifts to the domain of forecasting. In a sense, all of science is about making (conditional) predictions about the behavior of systems. Forecasting simply refers to the act of making predictions about the real-world future rather than about a specific controlled experimental setup. So while the domain of forecasting is far more restricted than the domain of science, we can still apply the conceptual framework of paradigm shifts in science to forecasting.

Thomas Kuhn and paradigm shifts

Thomas Kuhn's book The Structure of Scientific Revolutions (Amazon, Wikipedia) provides a detailed descriptive theory of the nature of paradigm shifts in science. Quoting from the Wikipedia page on paradigm shifts:

An epistemological paradigm shift was called a "scientific revolution" by epistemologist and historian of science Thomas Kuhn in his book The Structure of Scientific Revolutions.

A scientific revolution occurs, according to Kuhn, when scientists encounter anomalies that cannot be explained by the universally accepted paradigm within which scientific progress has thereto been made. The paradigm, in Kuhn's view, is not simply the current theory, but the entire worldview in which it exists, and all of the implications which come with it. This is based on features of landscape of knowledge that scientists can identify around them.

There are anomalies for all paradigms, Kuhn maintained, that are brushed away as acceptable levels of error, or simply ignored and not dealt with (a principal argument Kuhn uses to reject Karl Popper's model of falsifiability as the key force involved in scientific change). Rather, according to Kuhn, anomalies have various levels of significance to the practitioners of science at the time. To put it in the context of early 20th century physics, some scientists found the problems with calculating Mercury's perihelion more troubling than the Michelson-Morley experiment results, and some the other way around. Kuhn's model of scientific change differs here, and in many places, from that of the logical positivists in that it puts an enhanced emphasis on the individual humans involved as scientists, rather than abstracting science into a purely logical or philosophical venture.

When enough significant anomalies have accrued against a current paradigm, the scientific discipline is thrown into a state of crisis, according to Kuhn. During this crisis, new ideas, perhaps ones previously discarded, are tried. Eventually a new paradigm is formed, which gains its own new followers, and an intellectual "battle" takes place between the followers of the new paradigm and the hold-outs of the old paradigm. Again, for early 20th century physics, the transition between the Maxwellian electromagnetic worldview and the Einsteinian Relativistic worldview was neither instantaneous nor calm, and instead involved a protracted set of "attacks," both with empirical data as well as rhetorical or philosophical arguments, by both sides, with the Einsteinian theory winning out in the long run. Again, the weighing of evidence and importance of new data was fit through the human sieve: some scientists found the simplicity of Einstein's equations to be most compelling, while some found them more complicated than the notion of Maxwell's aether which they banished. Some found Eddington's photographs of light bending around the sun to be compelling, while some questioned their accuracy and meaning. Sometimes the convincing force is just time itself and the human toll it takes, Kuhn said, using a quote from Max Planck: "a new scientific truth does not triumph by convincing its opponents and making them see the light, but rather because its opponents eventually die, and a new generation grows up that is familiar with it."[1]

After a given discipline has changed from one paradigm to another, this is called, in Kuhn's terminology, a scientific revolution or a paradigm shift. It is often this final conclusion, the result of the long process, that is meant when the term paradigm shift is used colloquially: simply the (often radical) change of worldview, without reference to the specificities of Kuhn's historical argument.

Simple methods in science are often "good enough" until you want a much higher "resolution"

It's worth noting that most paradigm shifts move from simpler, more tractable models to more complicated ones. Initially, the scientific theory is not trying to explain the real world at too fine a resolution, and therefore it is tolerant of large errors. The theories in vogue initially are the simplest among those that can explain the world within the generous margin of error. Over time, as measurement becomes more precise and accurate, and the desire for understanding or engineering at stronger precision levels becomes more important, the focus shifts to finding a model where error rates are lower, accepting a possible increase in the model complexity.

Consider the following examples:

Complexity and paradigms in the context of forecasting

There are often competing methods for forecasting a given indicator. The methods vary considerably in complexity. For instance, persistence is one of the simplest forecasting methods: persistence of levels means that tomorrow will be the same as today, whereas persistence of trends means that the difference between tomorrow and today equals the difference between today and yesterday. Somewhat more sophisticated than simple persistence is various variations of linear regression that are well-suited to time series and tackle the problems both of periodic fluctuation and noise. More sophisticated methods allow for functional forms obtained by additive or multiplicative combination, or composition, of the functional forms used in simpler methods.

Here are some measures of complexity for forecasting methods:

Do complicated methods beat simpler methods?

The Makridakis Competitions are often cited as canonical sources of information for how different types of quantitative trend forecasting compare. Makridakis and Hibon draw four conclusions (listed on the linked page and in their papers) of which Finding 1 is most relevant to us: "Statistically sophisticated or complex methods do not necessarily provide more accurate forecasts than simpler ones." Some people (such as Nassim Nicholas Taleb) have used this to argue that sophisticated methods are useless.

The conclusion drawn by Makridakis and Hibon is supported by the data, but there is less to it than meets the eye. As noted earlier in the post, even the most revolutionary and impressive complicated scientific paradigms (such as relativity and quantum mechanics) only rarely outperform the simpler, more widely known paradigms (such as classical mechanics) except in cases that are designed to draw on the strength of the new paradigm (such as high speed or small length scales). And yet, in the cases where those slight improvements matter, we may be able to improve a lot by using the more sophisticated model. Just as knowledge of relativity makes possible a high-precision GPS that would have been impossible otherwise, new forecasting paradigms may make possible things (such as just-in-time inventory management) that would not have been possible at anywhere near that level of quality otherwise.

Of course, the selection of the sophisticated method matters: some sophisticated methods are simply wrong-headed and will therefore underperform simpler methods except in tailor-rigged situations. But the key point here is that an appropriately selected sophisticated model with access to adequate data and computational resources can systematically outperform simpler models. Finding (2) for the Makridakis Competitions is "The relative ranking of the performance of the various methods varies according to the accuracy measure being used." Finding 4 says "The accuracy of the various methods depends on the length of the forecasting horizon involved." The choice of best method also varies across types of time series (so the best method for macroeconomic time series could differ from the best method for time series provided by industries for their production or sales data).

Duncan Watts makes a similar point in his book Everything is Obvious: One You Know The Answer (paraphrased): sophisticated methods don't offer a huge advantage over simpler methods. But the best sophisticated methods are modestly better. And if you're operating at a huge scale (for instance, if you're running an electrical utility that needs to forecast consumer demand, or you're WalMart and you need to manage inventory to minimize waste, or if you're Google or Facebook and need to forecast the amount of traffic in order to budget appropriately for servers), even modest proportional improvements to accuracy can translate to huge absolute reductions in waste and increase in profits.

The evolution of complicated methods

Complicated methods can start off as performing a lot worse than simpler methods, and therefore be deemed useless. But then, at some point, they could start overtaking simpler methods, and once they overtake, they could rapidly gain on the simpler methods. What might change in the process? It could be any of these three, or some combination thereof.

  1. More data becomes available. This could arise because new measurement setups get deployed, or because existing measurement setups get a longer time series or get refined to a higher resolution. We can argue that with the advent of the Internet, it's much easier to collect a large amount of data, making it possible to use more complicated methods whose relative success depends on having more data available.
  2. More computational power becomes available. This could arise due to improvements in computing technology, or the building out of more computers. For instance, weather simulations today can use thousands of times as much computing power as weather simulations 30 years ago. Therefore, they can work with finer divisions of the grid on which forecasting is being done, allowing for more accurate weather simulation.
  3. The method itself, and/or the code to implement it, improve. Tweaks and edge case improvements to existing algorithms can improve them enough that they perform better, even holding data and computational power constant. Sometimes, the improvements require investing in customized hardware or backend software, which take some time to develop after the method is first released. In other cases, it's just about people coming up with new incremental improvements over the idea.

How do we judge the potential and promise of the new complicated forecasting method?

Given a complicated method that people claim could work given sufficient data or computing power that we don't yet have access to, how are we to judge the plausibility of the claims? The question is similar to the general question of whether a new proclaimed model or theory is the harbinger of a paradigm shift in a scientific discipline. I don't have satisfactory answers. In a subsequent post, I'll look at a few historical and current examples of changes of paradigm shifts in forecasting. The examples that I currently plan to cover are:

Any thoughts on the post as a whole would be appreciated, but I'm particularly interested in thoughts on this last topic in the post.

Thanks to Luke Muehlhauser for helpful early discussions that led to this post and to Jonah Sinick for his thoughts on an early draft of the post.

6 comments

Comments sorted by top scores.

comment by buybuydandavis · 2014-05-10T02:56:06.441Z · LW(p) · GW(p)

I wouldn't generalize too much from a forecasting competition.

Per Wolpert's No Free Lunch theorems, algorithm performance depends on fit to problem domain. The winner is likely a guy who lucked out on the choice of performance evaluation which fit his algorithm better than the competition. It doesn't mean he'll win the next competition. And it doesn't mean he isn't good, but it likely means that he was good and lucky.

How do we judge the potential and promise of the new complicated forecasting method?

Theory and judgment play a part.

When I first saw the Deep Learning method presented by Hinton, I was confident that it would be good without seeing the results, as it looked like a great theoretical approach, attacking the problem the right way.

Same thing with Wolpert and Stacked Generalization.

What to bet on? Things that theoretically look good, but are currently computationally cost prohibitive. As computers improve, there is an algorithmic land grab by researchers rushing into the areas that become computationally tractable.

Replies from: gwern
comment by gwern · 2014-05-10T21:11:47.509Z · LW(p) · GW(p)

Per Wolpert's No Free Lunch theorems, algorithm performance depends on fit to problem domain.

Aren't all these forecasting competitions using real data from real-world problems, and so NFL is irrelevant?

Replies from: buybuydandavis
comment by buybuydandavis · 2014-05-17T23:20:46.326Z · LW(p) · GW(p)

NFL not relevant to the real world? Would you like to elaborate?

Replies from: gwern
comment by gwern · 2014-05-18T01:27:32.494Z · LW(p) · GW(p)

Real-world problems are not a random sampling from all possible problems and there's plenty of structure to exploit, so invoking NFL in this context seems odd to me.

Replies from: buybuydandavis
comment by buybuydandavis · 2014-05-18T22:55:38.338Z · LW(p) · GW(p)

A real world competition isn't a random sample of anything. It's a selection of some problems, with some data. The performance of any algorithm will depend on fit to those problems, with those data.

My takeaways from the NFL theorems - the problems in the real world are some structured subset of all possible problems, and the performance of any generalizer for a problem will depend on fit to that problem.

Replies from: gwern
comment by gwern · 2014-05-20T17:27:19.566Z · LW(p) · GW(p)

The performance of any algorithm will depend on fit to those problems, with those data.

That's not chopped liver.