[Short version] Information Loss --> Basin flatness
post by Vivek Hebbar (Vivek) · 2022-05-21T12:59:10.932Z · LW · GW · 0 commentsContents
Summary: None No comments
This is an overview for advanced readers. Main post: Information Loss --> Basin flatness [LW · GW]
Summary:
Inductive bias is related to, among other things:
- Basin flatness
- Which solution manifolds (manifolds of zero loss) are higher dimensional than others. This is closely related to "basin flatness", since each dimension of the manifold is a direction of zero curvature.
In relation to basin flatness and manifold dimension:
- It is useful to consider the "behavioral gradients" for each input.
- Let be the matrix of behavioral gradients. (The column of is ).[1] We can show that .[2]
- .[3][4]
- Flat basin Low-rank Hessian Low-rank High manifold dimension
- High manifold dimension Low-rank Linear dependence of behavioral gradients
- A case study in a very small neural network shows that "information loss" is a good qualitative interpretation of this linear dependence.
- Models that throw away enough information about the input in early layers are guaranteed to live on particularly high-dimensional manifolds. Precise bounds seem easily derivable and might be given in a future post.
See the main post [LW · GW] for details.
- ^
In standard terminology, is the Jacobian of the concatenation of all outputs, w.r.t. the parameters.
- ^
- ^
Proof sketch for :
- is the set of directions in which the output is not first-order sensitive to parameter change. Its dimensionality is .
- At a local minimum, first-order sensitivity of behavior translates to second-order sensitivity of loss.
- So is the null space of the Hessian.
- So
- ^
There is an alternate proof [LW(p) · GW(p)] going through the result . (The constant 2 depends on MSE loss.)
0 comments
Comments sorted by top scores.