Machine Learning Can't Handle Long-Term Time-Series Data

post by lsusr · 2020-01-05T03:43:15.981Z · score: 2 (21 votes) · LW · GW · 10 comments

Contents

  Self-Driving Cars
  AlphaStar
  Recurrent Neural Networks (RNNs)
  The Road Ahead
None
10 comments

More precisely, today's machine learning (ML) systems cannot infer a fractal structure from time series data.

This may come as a surprise because computers seem like they can understand time series data. After all, aren't self-driving cars, AlphaStar and recurrent neural networks all evidence that today's ML can handle time series data?

Nope.

Self-Driving Cars

Uber's Self-Crashing Car

Self-driving cars use a hybrid of ML and procedural programming. ML (statistical programming) handles the low-level stuff like recognizing pedestrians. Procedural (nonstatistical) programming handles high-level stuff like navigation. The details of self-driving car software are trade secrets, but we can infer bits of Uber's architecture from the National Transportation Safety Board's report on Uber's self-crashing car as summarized [LW · GW] by jkaufman.

"If I'm not sure what it is, how can I remember what it was doing?" The car wasn't sure whether Herzberg and her bike were a "Vehicle", "Bicycle", "Unknown", or "Other", and kept switching between classifications. This shouldn't have been a major issue, except that with each switch it discarded past observations. Had the car maintained this history it would have seen that some sort of large object was progressing across the street on a collision course, and had plenty of time to stop.

We see here (above) is that the car throws away its past observations. Now let's take a look at a consequence of this.

"If we see a problem, wait and hope it goes away". The car was programmed to, when it determined things were very wrong, wait one second. Literally. Not even gently apply the brakes. This is absolutely nuts. If your system has so many false alarms that you need to include this kind of hack to keep it from acting erratically, you are not ready to test on public roads.

Humans have to write ugly hacks like this when when your system isn't architected bottom-up to handle concepts like the flow of time. A machine learning system designed to handle time series data should never have human beings in the loop this low down the ladder of abstraction. In other words, Uber effectively uses a stateless ML system.

All you need to know when driving a car is the position of different objects and their velocities. You almost never need to know the past history of another driver or even yourself. There is no such thing as time to a stateless system. A stateless system cannot understand the concept of time. Stateless ML systems make sense when driving a car.

AlphaStar

AlphaStar (DeepMind's StarCraft II AI) is only a little more complicated than Uber's self-crashing car. It uses two neural networks. One network predicts the odds of winning and another network figures out which move to perform. This turns a time-series problem (what strategy to perform) into a two separate stateless[1][2] problems.

Comparisons between AlphaStar and human beings are fudged because StarCraft II depends heavily on actions-per-minute (APM), the speed a player can perform actions. Humans wouldn't have a chance if AlphaStar was not artificially limited in the number of actions it would take. Games between humans and AlphaStar are only interesting because AlphaStar's actions are limited thereby giving humans a handicap.

Without the handicap, AlphaStar crushes human beings tactically. With the handicap, AlphaStar still crushes human beings tactically. Human beings can beat AlphaStar on occasion only because elite StarCraft II players possess superior strategic understanding.

Most conspicuously, human beings know how to build walls with buildings. This requires a sequence of steps that don't generate a useful result until the last of them are completed. A wall is useless until the last building is put into place. AlphaStar (the red player in the image below) does not know how to build walls.

AlphaStar (red) fails to build walls

With infinite computing power, AlphaStar could eventually figure this out. But we don't have infinite computing power. I don't think AlphaStar will ever figure out how walls work given its current algorithms and realistic hardware limitations.

AlphaStar is good at tactics and bad at strategy. To state this more precisely, AlphaStar hits a computational cliff for understanding conceptually complex strategies when time horizons exceed the tactical level. Human brains are not limited in this way.

Recurrent Neural Networks (RNNs)

RNNs are neural networks with a form of short-term memory. A newly-invented[3] variant incorporates long short-term memory. In both cases, the RNN is trained with the standard backpropagation algorithm used by all artificial neural networks[4]. The backpropagation algorithm works fine on short timescales but quickly breaks down when strategizing about longer timescales conceptually. The algorithm hits a computational cliff.

This is exactly the behavior we observe from AlphaStar. It's also the behavior we observe in natural language processing and music composition. ML can answer a simple question just fine but has trouble maintaining a conversation. ML can generate classical music just fine but can't figure out the chorus/verse system used in rock & roll. That's because the former can be constructed stochastically without any hidden variables while the latter cannot.

The Road Ahead

This brings us to my first law of artificial intelligence.

Any algorithm that is not organized fractally will eventually hit a computational wall, and vice versa.

―Lsusr's First Law of Artificial Intelligence

For a data structure to be organized fractally means you can cut a piece off and that piece will be a smaller version of the original dataset. For example, if you cut a sorted list in half then you will end up with two smaller sorted lists. (This is part of why quicksort works.) You don't have to cut a sorted list in exactly the middle to get two smaller sorted lists. The sorted list's fractal structure means can cut the list anywhere. In this way, a sorted list is organized fractally along one dimension. Others examples of fractal datastructures are heaps and trees.

Another fractal structure is a feed forward neural network (FFNN). FFNNs are organized fractally along two dimensions. There are two ways you can cut a neural network in half to get two smaller neural networks. The most obvious is to cut the network at half at a hidden layer. To do this, duplicate the hidden layer and then cut between the pair of duplicated layers. The less obvious way to cut a neural network is to slice between its input/output nodes.

Each dimension of fractality is a dimension the FFNN can scale indefinitely. FFNNs are good at scaling the number of input and output nodes they possess because a FFNN is structured fractally along this dimension (number of input/output nodes). FFNNs are good at scaling the complexity of processing they perform because a FFNN is structured fractally along this dimension too (number of hidden layers).

Much of the recent development in image recognition comes from these two dimensions of fracticality[5]. Image recognition has a high number of input nodes (all the color channels of all the pixels in an image). FFNN can apply complex rules to this large space because of its fractal geometry in the number of hidden layers dimension.

FFNNs are stateless machines so feeding time series data into a FFNN doesn't make sense. RNNs can handle time series data, but they have no mechanism for organizing it fractally. Without a fractal structure in the time dimension, RNNs cannot generalize information from short time horizons to long time horizons. They therefore do not have enough data to formulate complex strategies on long time horizons.

If we could build a neural network fractally-organized in the time domain then it could generalize (apply transfer learning) from short time horizons to long time horizons. This turns a small data problem into an big data problem. Small data problems are hard. Big data problems are easy.

This is why I'm so interested in Connectome-Specific Harmonic Waves [LW · GW] (CSHW). The fractal [LW · GW][6] equation of harmonic waves (Laplacian eigendecomposition) could answer the problem of how to structure a neural network fractally in the time domain.


  1. AlphaStar does contain one bit of time-series comprehension. It can predict the actions of enemy units hidden by fog of war. I'm choosing to ignore this on the grounds it isn't an especially difficult problem. ↩︎

  2. Edit: This is incorrect. It describes AlphaGo, not AlphaStar. AlphaStar uses stateful systems. ↩︎

  3. Edit: This is incorrect. Long short-term memory is not new. ↩︎

  4. The human brain lacks a known biological mechanism for performing the backpropagation algorithm used by artificial neural networks. Therefore biological neural networks probability use a different equation for gradient ascent. ↩︎

  5. Progress also comes from the application of parallel GPUs to massive datasets, but scaling int this way wouldn't be viable mathematically without the two-dimensional fractal structure of FFNNs. ↩︎

  6. Edit: Added hyperlink. ↩︎

10 comments

Comments sorted by top scores.

comment by interstice · 2020-01-05T06:12:34.177Z · score: 9 (6 votes) · LW(p) · GW(p)

Today's neural networks definitely have problems solving more 'structured' problems, but I don't think that 'neural nets can't learn long time-series data' is a good way of framing this. To go through your examples:

This shouldn’t have been a major issue, except that with each switch it discarded past observations. Had the car maintained this history it would have seen that some sort of large object was progressing across the street on a collision course, and had plenty of time to stop.

From a brief reading of the report, this sounds like this control logic is part of the system surrounding the neural network, not the network itself.

One network predicts the odds of winning and another network figures out which move to perform. This turns a time-series problem (what strategy to perform) into a two separate stateless[1] problems.

I don't see how you think this is 'stateless'. AlphaStar's architecture contains an LSTM('Core') which is then fed into the value and move networks, similar to most time series applications of neural networks.

Most conspicuously, human beings know how to build walls with buildings. This requires a sequence of steps that don’t generate a useful result until the last of them are completed. A wall is useless until the last building is put into place. AlphaStar (the red player in the image below) does not know how to build walls.

But the network does learn how to build its economy, which also doesn't pay off for a very long time. I think the issue here is more about a lack of 'reasoning' skills than time-scales: the network can't think conceptually, and so doesn't know that a wall needs to completely block off an area to be useful. It just learns a set of associations.

ML can generate classical music just fine but can’t figure out the chorus/verse system used in rock & roll.

MustNet was trained from scratch on MIDI data, but it's still able to generate music with lots of structure on both short and long time scales. GPT2 does the same for text. I'm not sure if MuseNet is able to generate chorus/verse structures in particular, but again this seems more like an issue of lack of logic/concepts than time scales(that is, MuseNet can make pieces that 'sound right' but has no conceptual understanding of their structure)

I'll note that AlphaStar, GPT2, and MuseNet all use the Transformer architecture, which seems quite effective for structured time-series data. I think this is because its attentional mechanism lets it zoom in on the relevant parts of past experiences.

I also don't see how connectome-specific-waves are supposed to help. I think(?) your suggestion is to store slow-changing data in the largest eigenvectors of the Laplacian -- but why would this be an improvement? It's already the case(by the nature of the matrix) that the largest eigenvectors of e.g. an RNN's transition matrix will tend to store data for longer time periods.

comment by lsusr · 2020-01-05T07:01:32.852Z · score: 5 (4 votes) · LW(p) · GW(p)

Thank you for the correction. AlphaStar is not completely stateless (even ignoring fog-of-war-related issues).

I think the issue here is more about a lack of 'reasoning' skills than time-scales: the network can't think conceptually...

This is exactly what I mean. The problem I'm trying to elucidate is that today's ML techniques can't create good conceptual bridges from short time-scale data to long time-scale data (and vice-versa). In other words, that they cannot generalize concepts from one time scale to another. If we want to take ML to the next level then we'll have to build a system that can. We may disagree about how to best phrase this but I think we're on the same page concerning the capabilities of today's ML systems.

As for connectome-specific harmonic waves, yes, my suggestion is to store slow-changing data in the largest eigenvectors of the Laplacian. The problem with LSTM (and similar RNN systems) is that there's a combinatorial explosion[1] when you try to backpropagate their state cells. This is the computational cliff I mentioned in the article.

The human brain has no known mechanism for conventional backpropagation in the style of artificial neural networks. I believe no such mechanism exists. I hypothesize instead that the human brain doesn't run into the aforementioned computational cliff because there's no physical mechanism to hit that cliff.

So if the human brain doesn't use backpropagation then what does it use? I think a combination of Laplacian eigenvectors and predictive modeling. If everything so far is true then this sidesteps the RNN computational cliff. I think it uses something involving resonance[2] between state networks instead, but we can reach this conclusion without knowing how the human brain works.

This is promising for a two related reasons: one involving power and the other involving trainability.

  • Concerning power: I think resonance could provide a conceptual bridge between shorter time-scales to longer time-scales. This solves the problem of fractal organization in the time domain and provides a computational mechanism for forming logic/concepts and then integrating them with larger/smaller parts of the internal conceptual architecture.
  • Concerning trainability: You don't have to backpropagate when training the human brain (because you can't). If CSHW and predictive modeling is how the human brain gradient ascends then this could completely sidestep the aforementioned computational cliff involved in training RNNs. Such a machine would require a hyperlinearly smaller quantity of training data to solve complex problems.

I think these two ideas work together; the human brain sidesteps the computational cliff because it uses concepts (eigenvectors) in place of raw low-level associations.


  1. I mean that the necessary quantity of training data explodes, not that it's hard to calculate the backpropagated connection weights for a single training datum. ↩︎

  2. Two state networks in resonance automatically exchange information and vice-versa. ↩︎

comment by David Guild (david-guild) · 2020-01-05T19:28:54.992Z · score: 6 (5 votes) · LW(p) · GW(p)

Your description of AlphaStar is wrong; that's the architecture for AlphaGo. One network to evaluate positions and one to suggest good moves. IIRC AlphaStar has three networks which are (roughly) build, move, look.

comment by ozziegooen · 2020-01-05T10:11:16.692Z · score: 6 (4 votes) · LW(p) · GW(p)

This article made it to Hacker News, where it got a few comments.

https://news.ycombinator.com/item?id=21959874

comment by Akshat Agrawal · 2020-01-07T02:09:01.111Z · score: 3 (2 votes) · LW(p) · GW(p)

Would note that LSTMs have been around for a relatively long time--defined in 1997 by Hochreiter and Schmidhuber, see https://www.bioinf.jku.at/publications/older/2604.pdf.

comment by maximkazhenkov · 2020-05-14T13:43:59.760Z · score: 1 (1 votes) · LW(p) · GW(p)
ML can generate classical music just fine but can't figure out the chorus/verse system used in rock & roll.

This statement seems outdated: openai.com/blog/jukebox/

To me this development came as a surprise and correspondingly an update towards "all we need for AGI is scale".

comment by lsusr · 2020-05-14T17:16:23.045Z · score: 2 (1 votes) · LW(p) · GW(p)

Which one of their songs has a repeated chorus? I could not identify one in the Elvis Presley rock song nor the Kay Perry pop song.

comment by Victor Chen (victor-chen) · 2020-01-07T02:08:54.061Z · score: 1 (1 votes) · LW(p) · GW(p)

My point is RNN+CNN can absolutely solving this big or small data problems.

comment by Camilo Cruz (camilo-cruz) · 2020-01-05T19:29:09.499Z · score: 1 (1 votes) · LW(p) · GW(p)

Excellent article. The only thing I would disagree in is that state and state machines have been nothing but snubbed when in fact they are essential. Remember the Darpa competition where the robots would fall all stiff without a change to damage control for example. Same for the Uber car, nill reaction. There are too many mathematicians applying continuous funcions, but we need more actual logic instead.

comment by Reitze Jansen (reitze-jansen) · 2020-01-05T19:28:48.500Z · score: 1 (1 votes) · LW(p) · GW(p)

RNN is a very naive approach as an architecture, completely disregarding state of the art LSTMs & wavenet, bad argument