# [Link]"Neural Turing Machines"

post by Prankster · 2014-10-31T08:54:43.151Z · score: 16 (17 votes) · LW · GW · Legacy · 22 commentsDiscusses the technical aspects of one of Googles AI projects. According to a pcworld the system "apes human memory and programming skills" (this article seems pretty solid, also contains link to the paper).

The abstract:

We extend the capabilities of neural networks by coupling them to external memory resources, which they can interact with by attentional processes. The combined system is analogous to a Turing Machine or Von Neumann architecture but is differentiable end-to-end, allowing it to be efficiently trained with gradient descent. Preliminary results demonstrate that Neural Turing Machines can infer simple algorithms such as copying, sorting, and associative recall from input and output examples.

(First post here, feedback on the appropriateness of the post appreciated)

## 22 comments

Comments sorted by top scores.

The press typically describes DeepMind as "highly secretive," but actually they publish a ton of their research — including this paper — in all the usual venues: NIPS, arxiv, etc.

That sounds fascinating. Could you link to some non-paywalled examples?

Highly secretive sounds cooler (to most of their audience). Personally, I find open science exciting.

**[deleted]**· 2014-10-31T19:03:33.942Z · score: 2 (2 votes) · LW(p) · GW(p)

Thanks for the link.

I've been studying neural networks for a while and sometimes wondered how feasible this would be. It seems analogous to a human brain externally storing some of its memories in a rigid medium like a notebook.

Some good comments on related work in https://www.reddit.com/r/MachineLearning/comments/2kth6d/googles_secretive_deepmind_startup_unveils_a/

(After reading the paper, I still don't really get how you wire some RAM into a neural network. Maybe it makes more sense to others.)

I would naively assume similar to the way we interface our own, with some kind of mapping of a subset at a time of the memory to a sensory input to the network, which then just needs to remember itself how to look up the relevant subsets.

It's not really RAM, but rather a tape. (like a doubly linked list) The LSTM controller can't specify any location in logarithmic space / time. They add multiple tape readers at one point though.

This tape could be compared to the phonological loop and the stream of consciousness.

A neural network is just a function of some sort (inputs to outputs). So is a finite state machine (which is what a regular Turing machine uses). Put that way, nothing conceptually interesting is going on. As usual, the devil is in the details here.

The interesting thing is that they have essentially implemented a (sort of) Turing machine with differentiable functions, which enables training using gradient-based techniques.

In principle you could always train resource-bounded Turing machines using combinatorial optimization techniques such as classical planners, SAT solvers or ILP solvers, theoretical complexity will be PSPACE-complete or NP-complete depending on the constraints that you impose. In practice, this is very much computationally expensive and doesn't scale past small examples.

If this differentiable approach generalizes past toy examples, it will allow efficient training. In addition to the obvious practical applications (*) I think it may also have a deeper philosophical value:

Consider word embeddings and the Vector space model: words are appear like discrete objects at syntactic level, while they represent concepts which intuitively feel like they lie in some sort of differentiable manifold: "David Cameron", "Angela Merkel" and "Jürgen Schmidhuber" are all quite similar to each other on the "species" dimension, Cameron and Merkel are more similar to each other than to Schmidhuber on the "job" dimension, Cameron and Schmidhuber are more similar on the "gender" dimension, Merkel and Schmidhuber are more similar on the "nationality" dimension. The concepts of "species", "job", "gender" and "nationality" also have their own topology, and so on.

It is plausible that this ability to perceive an essentially continuous topology over discrete things is a fundamental property of how our mind works.

Modern techniques in natural language processing and information retrieval are able to learn these kind of representations from data, but these are generally representations of individual words or short phrases. It is hypothesized that you can represent the meaning of arbitrarily complex sentences, and thus ideas, by combining the representations of their constituents, but nobody currently has a clue on how to do it beside simple averaging (which yields the general topic) and a few other tricks.

What about algorithms? It seems plausible that algorithms are the most general way of describing ideas. Algorithms also look extremely discrete and structured things, at least when you look at them in the context of the usual mathematical formalisms and of the programming languages that derive from these formalisms.

But we still have an intuitive understanding that algorithms should fit in a differentiable manifold, where it is possible to consider their similarity along various dimensions.

Recurrent neural networks are a way of embedding algorithms in a vector space where you can continuously move along one dimension and the performance measure varies continuously. In principle they are Turing-complete (up to resource bounds). In practice they can only represent well a certain class of algorithms (the ones that compute functions which are "Markovian" w.r.t. their input). Outside that class of algorithms, RNNs tend to become chaotic and can't be effectively trained or even practically computed.

"Long short term memory" neural networks (by the aforementioned Schmidhuber et al.) try to overcome this issue using an architecture more similar to that of conventional digital circuits. Their basic element is a differentiable variant of the flip-flop unit. Just like you can use flip-flops and feed-forward boolean circuits to represent any finite-state machine in a discrete way, you can use LSTM cells and feed-forward MLP units to represent any finite-state machine in a differentiable way, which usually avoids chaotic dynamics. In practice they do well on certain problems but don't scale past a certain difficulty, possibly for the same reason that custom digital electronics doesn't really scale as a solution for increasingly complex problems: past a certain point, you want to "program" things.

This "Neural Turing model" is an attempt to do that. If it turns out that it generalizes past toy problems, then it means that it is a "natural" way to embed algorithms of practical interests in a differentiable manifold. It will be a way to access the hidden topology of algorithms, and thus arbitrary ideas, that we intuitively "see" but we weren't able to mathematically model so far.

( * and presumably scaring the s*it out of MIRI once Google plugs it to a reinforcement learning framework :) )

Another advantage is that neuronal nets in most cases allow for fairly strainforward incorporation of additional information (side channels, lower layers). If this is combinaed with the algorithm learning is does allow fairly general integrative capabilities.

If you want to put it that way, nothing conceptually interesting is going on in any neural network paper - we already know they're universal.

My problem is the details: I visualize neural networks as like pachinko machines or Galton's quincunx - you drop a bunch of bits (many balls) into the top (bottom) layer of neurons (pins) and they cascade down to the bottom based on the activation functions (spacing of pins & how many balls hit a pin simultaneously), and at the bottom (top) is emitted a final smaller output (1 ball somewhere). I don't get the details of what it means to add a memory to this many-to-one function.

Recurrent Neural Networks feed their outputs back into their inputs, and run continuously. Instead of visualizing a river flowing directly downstream, visualize a river with a little cove off to one side, where vortices swirl around.

I think another way to look at neural networks is they are nested non-linear regression models.

I am probably in the minority here, but I don't think the stuff in the OP is that interesting.

I hate the term "Neural Network", as do many serious people working in the field.

There are Perceptrons which were inspired by neurons but are quite different. There are other related techniques that optimize in various ways. There are real neurons which are very complex and rather arbitrary. And then there is the greatly simplified Integrate and Fire (IF) abstraction of a neuron, often with Hebbian learning added.

Perceptrons solve practical problems, but are not the answer to everything as some would have you believe. There are new and powerful kernal methods that can automatically condition data which extend perceptrons. There are many other algorithms such as learning hidden Markov models. IF neurons are used to try and understand brain functionality, but are not useful for solving real problems (far too computationally expensive for what they do).

Which one of these quite different technologies is being referred to as "Neural Network"?

The idea of wiring perceptrons back onto themselves with state is old. Perceptrons have been shown to be able to emulate just about any function, so yes, they would be Turing complete. Being able to learn meanginful weights for such "recurrent" networks is relatively recent (1990s?).

I'd think that deep neural networks as here with e.g. backprogation thru time/BPP are meant.