"Human-level control through deep reinforcement learning" - computer learns 49 different games

post by skeptical_lurker · 2015-02-26T06:21:33.036Z · score: 11 (12 votes) · LW · GW · Legacy · 19 comments

full text

 

This seems like an impressive first step towards AGI. The games, like 'pong' and 'space invaders' are perhaps not the most cerebral games, but given that deep blue can only play chess, this is far more impressive IMO. They didn't even need to adjust hyperparameters between games.

 

I'd also like to see whether they can train a network that plays the same game on different maps without re-training, which seems a lot harder.

 

19 comments

Comments sorted by top scores.

comment by Sean_o_h · 2015-02-26T12:34:01.906Z · score: 5 (5 votes) · LW(p) · GW(p)

They've also released their code (for non-commercial purposes): https://sites.google.com/a/deepmind.com/dqn/

In other interesting news, a paper released this month describes a way of 'speeding up' neural net training, and an approach that achieves 4.9% top 5 validation error on Imagenet. My layperson's understanding is that this is the first time human accuracy has been exceeded on the Imagenet benchmarking challenge, and represents an advance on Chinese giant Baidu's progress reported last month, which I understood to be significant in its own right. http://arxiv.org/abs/1501.02876

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Sergey Ioffe, Christian Szegedy

(Submitted on 11 Feb 2015 (v1), last revised 13 Feb 2015 (this version, v2))

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters."

comment by jkrause · 2015-02-26T17:47:33.662Z · score: 5 (5 votes) · LW(p) · GW(p)

My layperson's understanding is that this is the first time human accuracy has been exceeded on the Imagenet benchmarking challenge, and represents an advance on Chinese giant Baidu's progress reported last month, which I understood to be significant in its own right. http://arxiv.org/abs/1501.02876

One thing to note about the number for human accuracy for ImageNet that's been going around a lot recently is that it was really a relatively informal experiment done by a couple of members of the Stanford vision lab (see section 6.4 of the paper for details). In particular, the number everyone cites was just one person, who, while he trained himself quite a while to recognize the ImageNet categories, nonetheless was prone to silly mistakes from time to time. A more optimistic human error is probably closer to 3-4%, but with that in mind the recent results people have been posting are still extremely impressive.

It's also worth pointing another paper from Microsoft Research that beat the 5.1% human performance and actually came out a few days before Google's. It's a decent read, and I wouldn't be surprised if people start incorporating elements from both MSR and Google's papers in the near future.

comment by Houshalter · 2015-03-01T19:29:14.529Z · score: 1 (1 votes) · LW(p) · GW(p)

Here is the guy who tried to get his own accuracy on imagenet: https://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/

Getting 5.1% error was really hard, takes a lot of time to get familiar with the classes and to sort through reference images. The 3% error was an entirely hypothetical, optimistic estimate, of a group of humans that make no mistakes.

If you want to appreciate it, you can try the task yourself here: http://cs.stanford.edu/people/karpathy/ilsvrc/

comment by skeptical_lurker · 2015-02-26T12:56:16.370Z · score: 3 (3 votes) · LW(p) · GW(p)

I saw this paper before, and maybe I'm being an idiot but I didn't understand this:

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change.

I thought one generally trained the networks layer by layer, so layer n would be completely finished training before layer n+1 starts. Then there is no problem of "the distribution of each layer's inputs changes" because the inputs are fixed once training starts.

Admittedly, this is a problem if you don't have all the training data to start of with and want to learn incrementally, but AFAICT that is not generally the case in these benchmarking contests.

Regardless, its amazing how simple DNNs are. People have been working on computer vision and AI for about 60 years, and then a program like this comes along which is only around 500 lines of code, conceptually simple enough to explain to anyone with a reasonable mathematical background, but can nevertheless beat humans at a reasonable range of tasks.

comment by gwern · 2015-02-26T18:01:47.132Z · score: 9 (9 votes) · LW(p) · GW(p)

Regardless, its amazing how simple DNNs are. People have been working on computer vision and AI for about 60 years, and then a program like this comes along which is only around 500 lines of code, conceptually simple enough to explain to anyone with a reasonable mathematical background, but can nevertheless beat humans at a reasonable range of tasks.

I get the impression it's a hardware issue. See for example http://nautil.us/issue/21/information/the-man-who-tried-to-redeem-the-world-with-logic - McCulloch & Pitts invented neural networks almost before digital computers existed* and he was working on "three-dimensional neural networks". They didn't invent backpropagation, I don't think, but even if they had, how would they have run, much less trained, the state of the art many-layer neural networks with millions of nodes and billions of connections like we're seeing these days? What those 60 years of work gets you is a lot of specialized algorithms which don't reach human-parity but at least are computable on the hardware of that day.

* depends on what exactly you consider the first digital computer and how long before the key publication you date their breakthrough.

comment by jkrause · 2015-02-26T18:42:00.739Z · score: 8 (8 votes) · LW(p) · GW(p)

Can confirm that hardware (and data!) are the two main culprits here. The actual learning algorithms haven't changed much since the mid 1980s, but computers have gotten many times faster, GPUs are 30-100x faster still, and the amount of data has similarly increased by several orders of magnitude.

comment by jkrause · 2015-02-26T17:33:23.881Z · score: 5 (5 votes) · LW(p) · GW(p)

Training networks layer by layer was the trend from the mid to late 2000s up until early 2012, but that changed in mid 2012 when Alex Krizhevsky and Geoff Hinton finally got neural nets to work for large-scale tasks in computer vision. They simply trained the whole network jointly with stochastic gradient descent, which has remained the case for most neural nets in vision since then.

comment by skeptical_lurker · 2015-02-26T20:32:36.469Z · score: 3 (3 votes) · LW(p) · GW(p)

Really? I was under the impression that training the whole network with gradient decent was impossible, because the propagated error becomes infinitesimally small. In fact, I thought that training layers individually was the insight that made DNNs possible.

Do you have a link about how they managed to train the whole network?

comment by jkrause · 2015-02-26T21:09:26.200Z · score: 5 (5 votes) · LW(p) · GW(p)

That was indeed one of the hypotheses about why it was difficult to train the networks - the vanishing gradient problem. In retrospect, one of the main reasons why this happened was the use of saturating nonlinearities in the network -- nonlinearities like the logistic function or tanh which asymptote at 1. Because they asymptote, their derivatives always end up being really small, and the deeper your network the more this effect compounds. The first large-scale network that fixed this was by Krizhevsky et al., which used a Rectified Linear Unit (ReLU) for their nonlinearity, given by f(x) = max(0, x). The earliest reference I can find to using ReLUs is Jarrett et al., but since Krizhevsky's result pretty much everyone uses ReLUs (or some variant thereof). In fact, the first result I've seen showing that logistic/tanh nonlinearities can work is the batch normalization paper Sean_o_h linked, which gets around the problem by normalizing the input to the nonlinearity, which presumably prevents the units from saturating too much (though this is still an open question).

comment by V_V · 2015-02-27T15:48:40.036Z · score: 0 (0 votes) · LW(p) · GW(p)

I was under the impression that training the whole network with gradient decent was impossible, because the propagated error becomes infinitesimally small.

If you do it naively, yes. But researches figured out how to attack that problem from multiple angles: from the choice of the non-linear activation function, to specifics of the optimization algorithm, to the random distribution used to sample the initial weights.

Do you have a link about how they managed to train the whole network?

The batch normalization paper cited above is one example of that.

comment by V_V · 2015-02-27T16:13:43.272Z · score: 1 (1 votes) · LW(p) · GW(p)

Regardless, its amazing how simple DNNs are. People have been working on computer vision and AI for about 60 years, and then a program like this comes along which is only around 500 lines of code, conceptually simple enough to explain to anyone with a reasonable mathematical background, but can nevertheless beat humans at a reasonable range of tasks.

Beware, there is a lot of non-obvious complexity in these models:
"Traditional" machine learning models (i.e. logistic regression, SVM, random forests) only have few hyperparameters and they are not terribly sensitive to their values, hence you can usually tune them coarsely and quickly.
These fancy deep neural networks can easily have tens, if not hundreds of hyperparameters, and they are often quite sensitive to them. A bad choice can easily make your training procedure quickly stop making progress (insufficient capacity/vanishing gradients) or diverge (exploding gradients) or converge to something which doesn't generalize well on unseen data (overfitting).
Finding a good choice of hyperparameters can be really a non-trivial optimization problem on its own (and a combinatorial one, since many of these hyperparameters are discrete and you can't really expect the model performances to depend monotonically on their values).
Unfortunately, in these DNN papers, especially the "better than humans" ones, hyperparameters values often seem to appear out of nowhere.
There is some research and tools to do that systematically, but it is not often discussed in the papers presenting novel architectures and results.

comment by skeptical_lurker · 2015-02-28T14:40:12.132Z · score: 1 (1 votes) · LW(p) · GW(p)

SVMs are pretty bad for hyperparameters too, if you want a simple model use random forests or naive bayes.

I struggle to see how DNNs can have hundreds of hyperparameters - looking at the code for the paper I linked to, they seem to have learning rate, 2 parameters for simulated annealing, weight cost and batch size. That's 5, not counting a few others which only apply to reinforcement learning DNNs. Admittedly, there is the choice of sigmoid/rectilinear, and of the number of neurons, layers and epocs, but these last few are largely determined by what hardware you have and how much time you are willing to spend training.

Having skimmed the paper you linked to, it seems they have hundreds of parameters because they are using a rather more complex network topology with SVMs fitting the neuron activation to the targets. And that's interesting in itself.

Unfortunately, in these DNN papers, especially the "better than humans" ones, hyperparameters values often seem to appear out of nowhere.

The general problem of hyperparameter values is one of the things that worries me about academia. So you have an effect (p1, which is an improvement I suppose.

Oh, and this paper was published in Nature.

There is some research and tools to do that systematically, but it is not often discussed in the papers presenting novel architectures and results.

I'd be surprised if this could work with DNNs - AKAIK, monte-carlo optimization, for instance, generally takes thousands of evaluations steps, yet with DNNs each evaluation step would require days of training, so it would require thousands of GPU-days. Indeed, the paper you linked to ran 1200 evaluations, so I'm guessing they had a lot of hardware.

comment by V_V · 2015-03-05T18:17:31.795Z · score: 1 (1 votes) · LW(p) · GW(p)

SVMs are pretty bad for hyperparameters too

How so? Linear SVM main hyperparameter is the regularization coefficient. There is also the choice of loss and regularization penalty, but these are only a couple of bits.
Non-linear SVM has also the choice of the kernel (in practice it's either RBF or polynomial, unless you are working on special types of data such as strings or trees) and one or two kernel hyperparameters.

I struggle to see how DNNs can have hundreds of hyperparameters - looking at the code for the paper I linked to, they seem to have learning rate, 2 parameters for simulated annealing, weight cost and batch size. That's 5, not counting a few others which only apply to reinforcement learning DNNs. Admittedly, there is the choice of sigmoid/rectilinear, and of the number of neurons, layers and epocs,

I haven't read all the paper, but at glance you have: Number of convolutional layers, number of non-convolutional layers, number of nodes in each non-convolutional layer, for each convolutional layer number of filters, filter size and stride. There are also 16 other hyperparameters described here.
You could also count the preprocessing strategy.

Other papers have even more hyperparameters (max-pooling layers each with a window size, dropout layers each with a dropout rate, layer-wise regularization coefficients, and so on).

comment by JWonz · 2015-03-08T17:57:18.523Z · score: 0 (0 votes) · LW(p) · GW(p)

FYI - to those who are running the code, the atari ROMs must be named properly otherwise you will hit a segmentation fault. For example, with Breakout name it "breakout.bin".

comment by [deleted] · 2015-02-27T22:09:30.524Z · score: 2 (2 votes) · LW(p) · GW(p)

I'd say whether or not this is approach leads to AGI, its main value is how an extremely simple algorithm can produce amazing, unexpected results. That is, the hope and inspiration that comes from it is worth more than the algorithm itself.

I've always believed at bottom level there is a very simple thing going on that gives rise to learning and intelligence. The complexity comes from the large numbers of simple elements operating in parallel.

But the human race has been putting all its resources into the search for complex, intricate solutions.

I used to think Kurzweil was a fool for thinking that once computation is cheap enough (per watt) somehow magically we'd get intelligent machines. But it may just be he'll turn out to be right. Because the breakthrough won't come from the monolithic research teams with unlimited funds at their disposal, it will come from an anonymous hacker just playing around with algorithms on his home supercomputer.

comment by skeptical_lurker · 2015-02-26T06:27:56.117Z · score: 1 (1 votes) · LW(p) · GW(p)

full paper

comment by skeptical_lurker · 2015-02-26T06:31:18.386Z · score: 0 (0 votes) · LW(p) · GW(p)

Ok, I have no idea what the syntax is doing. It works in comments and the sandbox, but not in the discussion post. I have tried escaping characters, doesn't help.

comment by CronoDAS · 2015-02-26T06:33:41.089Z · score: 4 (4 votes) · LW(p) · GW(p)

Posts use straight-up HTML created by an editor, comments use Markdown syntax.

comment by skeptical_lurker · 2015-02-26T06:43:45.412Z · score: 1 (1 votes) · LW(p) · GW(p)

Ok thanks, finally got that working.