Posts
Comments
Early bird gets the work, but the second mouse gets the cheese. (From Steven Pinker, I think, not sure if it's original)
I think the problem with vanishing gradients is usually linked to repeated applications of the sigmoid activation function. The gradient in backpropagation is calculated from the chain rule, where each factor d\sigma/dz in the "chain" will always be less than zero, and close to zero for large or small inputs. So for feed-forward network, the problem is a little different from recurrent networks, which you describe.
The usual mitigation is to use ReLU activations, L2 regularization, and/or batch normalization.
A minor point: the gradient doesn't necessarily tend towards zero as you get closer to a local minimum, that depends on the higher order derivatives. Imagine a local minimum at the bottom of a funnel or spike, for instance - or a very spiky fractal-like landscape. On the other hand, a local minimum in a region with a small gradient is a desirable property, since it means small perturbations in the input data doesn't change the output much. But this point will be difficult to reach, since learning depends on the gradient...
(Thanks for the interesting analysis, I'm happy to discuss this but probably won't drop by regularly to check comments - feel free to email me at ketil at malde point org)