[Short version] Information Loss --> Basin flatness

post by Vivek Hebbar (Vivek) · 2022-05-21T12:59:10.932Z · LW · GW · 0 comments

Contents

  Summary:
None
No comments

This is an overview for advanced readers.  Main post: Information Loss --> Basin flatness [LW · GW]

Summary:

Inductive bias is related to, among other things:

In relation to basin flatness and manifold dimension:

  1. It is useful to consider the "behavioral gradients"  for each input. 
  2. Let  be the matrix of behavioral gradients.  (The  column of  is ).[1]  We can show that .[2]
  3. .[3][4]
  4. Flat basin    Low-rank Hessian    Low-rank     High manifold dimension
  5. High manifold dimension    Low-rank     Linear dependence of behavioral gradients 
  6. A case study in a very small neural network shows that "information loss" is a good qualitative interpretation of this linear dependence.
  7. Models that throw away enough information about the input in early layers are guaranteed to live on particularly high-dimensional manifolds.  Precise bounds seem easily derivable and might be given in a future post.

See the main post [LW · GW] for details.

  1. ^

    In standard terminology,  is the Jacobian of the concatenation of all outputs, w.r.t. the parameters.

  2. ^

     is the number of parameters in the model.  See claims 1 and 2 here [LW · GW] for a proof sketch.

  3. ^

    Proof sketch for :

    •   is the set of directions in which the output is not first-order sensitive to parameter change.  Its dimensionality is .
    • At a local minimum, first-order sensitivity of behavior translates to second-order sensitivity of loss.
    • So  is the null space of the Hessian.
    • So 
  4. ^

    There is an alternate proof [LW(p) · GW(p)] going through the result .  (The constant 2 depends on MSE loss.)

0 comments

Comments sorted by top scores.