(When) do high-dimensional spaces have linear paths down to local minima?

post by Thomas Kwa (thomas-kwa) · 2022-04-22T15:35:55.215Z · LW · GW · 2 comments

This is a question post.

Contents

  Answers
    5 Razied
    3 Dennis Towne
None
2 comments

When training (to convergence) using gradient descent in high dimensions, it's common for there to be monotonically decreasing loss on the linear path between the initialization weights and the local minimum found, even if the actual path followed by training is highly nonlinear. Is this true for high-dimensional spaces in general?

If this doesn't usually happen, what's the underlying fact about the high-dimensional space which determines whether monotonically decreasing loss holds?

Answers

answer by Razied · 2022-04-22T15:54:44.664Z · LW(p) · GW(p)

This is certainly true for all convex loss landscapes in all dimensions, almost by definition of "convex". I don't think anyone understands very much about the properties of non-convex high-dimensional loss landscapes, but I can say that in the case of deep learning having monotonically decreasing loss on the linear path between the initialization and the end of training, the weights we obtain when we arbitrarily decide to stop gradient descent aren't anywhere close to a local minimum of the landscape. Basically all networks of any useful size get stuck at saddle points at best, or we stop training before they even get the chance to be stuck. So it might be the case that a linear path to the actual minimum would not have monotonic loss at all, it's just that high-dimensional spaces are so mindbogglingly large that we never explore far enough from the init point to have the chance to get behind a mountain in the landscape, so to speak. 

answer by Dennis Towne · 2022-04-22T17:23:31.045Z · LW(p) · GW(p)

Specifically for protein folding:  no, it does not decrease monotonically, unless you look at it from such a large distance that you can ignore thermal noise.

Proteins fold in a soup of water and other garbage, and for anything complicated there are going to be a lot of folding steps which are only barely above the thermal noise energy.  Some proteins may even "fold" by performing a near-perfect random walk until they happen to fall into a valley that makes escape unlikely.

There may even be folding steps which are slightly disfavored, eg. require energy from the environment.  Thermal noise can provide this energy for long enough that a second step can occur, leading to a more stable configuration.

comment by tailcalled · 2022-04-22T18:10:17.926Z · LW(p) · GW(p)

The folding steps aren't linear paths though.

2 comments

Comments sorted by top scores.

comment by ACrackedPot · 2022-04-22T15:44:03.532Z · LW(p) · GW(p)

There's a phenomenon in multidimensional motion called "gimbal locking", in which the number of effective dimensions decrease over time under motion owing to local correlations between the dimensions, which I believe may be relevant here.

comment by Cedar (xida-ren) · 2022-07-07T14:47:52.707Z · LW(p) · GW(p)

https://www.lesswrong.com/posts/Hna2P8gcTyRgNDYBY/race-along-rashomon-ridge [LW · GW]

This feels related. This also talks about paths in hyperparameter space, but instead of linear paths it talks about paths consisting of optimal models between two optimal models.