Six (and a half) intuitions for SVD

callummcdougall

Six (and a half) intuitions for SVD

post by CallumMcDougall (TheMcDouglas) · 2023-07-04T19:23:19.688Z · LW · GW · 1 comments

  Summary
  Introduction
    Motivation
    Notation
  6 ½ Intuitions
    1. Rotations and Scalings (geometric picture)
    2. Best Low-Rank Approximations
    3. Least Squares Regression
    4. Input and Output Directions (like MLPs!)
    5. Lost & Preserved Information
    6A. Principal Component Analysis
    6B. Information Compression
  Final Thoughts
  Appendix - First-Principles Proof of SVD
  Appendix - Eckart-Young Theorem
None
1 comment

The long-awaited^[1] sequel to my "Six (and a half) intuitions for KL divergence [LW · GW]" is finally here!

Thanks to the following people for feedback: Denizhan Akar, Rudolf Laine, Simon Lermen, Aryan Bhatt, Spencer Becker-Kahn, Arthur Conmy, and anonymous members of my ARENA cohort (-:

If you want to 80-20 this post (and you already know what SVD is), then just read the "Summary" section below.

Summary

The SVD is a matrix decomposition , where U and V are orthogonal, and S is diagonal (with non-negative diagonal elements, which we call singular values). The six intuitions are:

1. Rotations and Scalings (geometric picture)

Orthogonal matrices $U$ and $V$ can be thought of as rotations of our basis, and $S$ as a scaling of our new basis. These operations are easier to geometrically visualise than the entire operation of a matrix, with all its nuances (e.g. things like shear).

2. Best rank- $k$ Approximations

Truncating the SVD sum gives us in some sense the "best way" to approximate $M$ as a sum of rank-1 matrices. In other words, the first few matrices in the SVD sum are the "most important parts of $M$ ".

3. Least Squares Regression

If we're trying to minimize $∥ M x - b ∥$ (either with a constraint on the norm of $x$ or with no constraint), we can express the solution for $x$ in terms of the SVD of $M$ . When $∥ x ∥$ is constrained, it becomes most important to get the components of $x$ right in the directions of the largest singular values of $M$ ; the smaller singular values matter less.

4. Input and Output Directions (like MLPs!)

An MLP can be thought of as a collection of neurons, where each neuron has an input direction (thought of as detecting a feature) and an output direction (outputting a result if that feature is observed). $v_{i}$ and $u_{i}$ are like the input and output directions (minus the nonlinearity in the middle), and furthermore each "neuron" in this case is completely independent (orthogonal).

5. Lost & Preserved Information

SVD tells you what information (if any) is lost when you use $M$ as a linear map. It also tells you which bits of information are easier / harder to recover.

6A. Principal Component Analysis

If $M$ is actually a matrix of data, we can interpret the SVD matrices in terms of that data. $V$ tells you which features (= linear combinations of data) explain the most variance in the data, $S$ tells you how large this variance is, and $U$ tells you how much each datapoint is exposed to the corresponding feature.

6B. Information Compression

The Fourier transform is a popular choice of basis when compressing things like images (because with a few low-frequency Fourier basis vectors, we can usually reconstruct the image faithfully). We can think of SVD in a similar way, because it gives us the "best basis to use" for reconstructing our image, in some sense.

There are 2 key ideas running through most of these points:

$V$ as input directions and $U$ as the corresponding output directions (i.e. we calculate $M x$ by projecting $x$ onto the input directions, and using these projections as coefficients for our output directions)
The SVD as being a way to efficiently represent the most important parts of the matrix, especially for low-rank matrices.

Introduction

Motivation

I'm currently doing a lot of mechanistic interpretability, and SVD is a key concept when thinking about the operations done inside transformers. Matrices like $W_{O V}$ and $W_{Q K}$ are very large and low-rank, making SVD the perfect tool for interpreting them. Furthermore, we have evidence [LW · GW] that the SVD of weight matrices in transformers are highly interpretable, with the principle directions aligning with semantically interpretable directions in the residual stream.

On top of this, I think deeply understanding SVD is crucial for understanding how to think about matrices in general, and although there's a lot of valuable stuff online, it's not been brought together into a single resource. I hope I can do something similar with this post as I did with my KL divergence post.

Notation

Consider a matrix $M$ , with size $(m, n)$ . The singular value decomposition (SVD) is $M = U S V^{T}$ , where:

The columns of $U$ and $V$ are orthogonal unit vectors,
$S$ is a diagonal matrix with elements $(σ_{1}, . . ., σ_{r})$ where $r$ is the rank of matrix $M$ , and the singular values $σ_{i}$ are positive & in decreasing order: $σ_{1} \geq σ_{2} \geq . . . σ_{r} > 0$ .

There are a few different conventions when it comes to SVD. Sometimes it's written with $U, S, V$ as having sizes $(m, m), (m, n), (n, n)$ respectively (in other words we pad $S$ with zeros, and fill out $U$ and $V$ with a complete basis). Alternatively, the matrices can also be written with shapes $(m, r), (r, r), (r, n)$ , in other words the matrix $S$ has no zero diagonal elements. I'll most often use the second convention, but there are times when I'll use the first (it should be clear which one I'm using at any given time).

Lastly, note that we can also write $M$ as follows:

M = r \sum i = 1 σ_{i} u_{i} v_{i}^{T}

in other words as a sum of rank-1 matrices, scaled by the singular values. I claim this is the most natural way to think about SVD, and it's the form I'll be using for most of the rest of the post. For convenience, I'll refer to this as the "SVD sum".

6 ½ Intuitions

Note that there's a lot of overlap between some of these points, and some of them cut a lot closer to the "core of SVD" than others. You might say that they're . . . not linearly independent, and have different singular values. (I make no apologies for that pun.)

1. Rotations and Scalings (geometric picture)

Orthogonal matrices $U$ and $V$ can be thought of as rotations of our basis, and $S$ as a scaling of our new basis. These operations are easier to geometrically visualise than the entire operation of a matrix, with all its nuances (e.g. things like shear).

$U$ and $V$ are orthogonal matrices, meaning that their column vectors $u_{i}$ and $v_{i}$ are orthogonal unit vectors. We can think of them as rotations in high-dimensional space (in fact, any orthogonal matrix can be formed via a series of rotations, and possibly a reflection). $S$ is a diagonal matrix, which means it scales input along the standard basis.

The key point here is that we're taking $M$ , a general linear operation which it's hard to get geometric intuition for, and breaking it up into a series of operations which are much easier to visualise. Let's take possibly the simplest example of a non-trivial matrix operation: a shear.

The illustration below shows how this can be broken down as a rotation, scaling along the standard basis, then another rotation. See here for the animated version.

2. Best Low-Rank Approximations

Truncating the SVD sum gives us in some sense the "best way" to approximate $M$ as a sum of rank-1 matrices. In other words, the first few matrices in the SVD sum are the "most important parts of $M$ ".

One natural way to formalise "best approximation" $M_{k}$ would be the matrix which minimises the value $∥ M - M_{k} ∥$ , where $∥ \cdot ∥$ is some reasonable choice for the norm of a matrix. For example, you might use:

The Spectral norm, $∥ A ∥_{2} = max ∥ x ∥_{2} \leq 1 ∥ A x ∥_{2}$
The Frobenius norm, $∥ A ∥_{F} = \sqrt{\sum_{i j} A_{i j}^{2}}$

As it happens, the choice $M_{k} = \sum_{i = 1}^{k} σ_{i} u_{i} v_{i}^{T}$ minimises the residual $∥ M - M_{k} ∥$ for both of these norms (subject to the restriction that $M_{k}$ can have at most rank $k$ ). This is called the Eckart-Young Theorem.^[2] You can find a sketch of both proofs in the appendix.

Note - the proof relies heavily on the following lemmas:

The spectral norm of a matrix is its largest singular value.
The squared Frobenius norm of a matrix equals the sum of its squared singular values.

This actually hints at an important meta point in linear algebra - important concepts like Frobenius norm, trace, determinant, etc. often make a lot more sense when they're defined in terms of the matrix when viewed as a linear map^[3], rather than when viewed as a grid of numbers. In this case, defining the Frobenius norm as the sum of squared singular values in SVD was a lot more natural than describing it as the sum of squared elements (it's arguably easier to see how the former definition captures some notion of the "size" of the matrix). To give another example, it's often more natural to describe the trace as the sum of eigenvalues than the sum of diagonal elements (it's quite easy to prove the latter if you start from the former). For more on this meta point, see this section of Neel Nanda's linear recorded algebra talk.

Key idea - the singular vectors corresponding to the largest singular values are the best way of efficiently capturing what the matrix $M$ is actually doing. If you capture most of the large singular values, then you've explained most of the operation of matrix $M$ (the residual linear transformation is pretty small).

3. Least Squares Regression

If we're trying to minimize $∥ M x - b ∥$ (either with a constraint on the norm of $x$ or with no constraint), we can express the solution for $x$ in terms of the SVD of $M$ . When $∥ x ∥$ is constrained, it becomes most important to get the components of $x$ right in the directions of the largest singular values of $M$ ; the smaller singular values matter less.

Firstly, let's take the least squares expression:

\begin{matrix} min x ∥ M x - b ∥_{2}^{2} \end{matrix}

and substitute in the singular value decomposition of $M$ . Spectral norm is unchanged when you perform unitary maps, so:

∥ M x - b ∥_{2} = ∥ U S V^{T} x - b ∥_{2} = ∥ S x_{v} - b_{u} ∥_{2}

where:

$x_{v} = V^{T} x$ are the components of $x$ in the basis created from the columns of $V$
$b_{u} = U^{T} b$ are the components of $b$ in the basis created by columns of $U$

When written in this form, we can read off a closed-form expression for the solution:

x_{v_{i}} = {\begin{matrix} b_{u_{i}} / σ_{i} & if σ_{i} > 0 doesn't matter & else \end{matrix}

where $x_{v_{i}} = (x_{v})_{i}$ is the component of $x$ along the $i$ -th column of $V$ , and $b_{u_{i}} = (b_{u})_{i}$ is the component of $b$ along the $i$ -th column of $U$ . This result suggests the following terminology^[4], which we'll use from here on out:

the columns of $V$ are the input directions of the matrix $M$ ,
the rows of $U$ are the corresponding output directions.

The problem of least squares regression then reduces to a simple one: make sure the components of $x$ along the input directions match up with the corresponding target output directions.

What about constrained optimization? Suppose we were trying to minimize $∥ M x - b ∥_{2}^{2}$ subject to the restriction $∥ x ∥_{2} \leq m$ . We can write the solution in this case as $x_{v_{i}} = \frac{1}{σ_{i} + λ / σ_{i}} b_{u_{i}}$ , where $λ$ is the smallest possible non-negative real number s.t. $∥ x ∥_{2} \leq m$ .^[5] Note that, the larger the singular values $σ_{i}$ are, the closer our coefficient $\frac{1}{σ_{i} + λ / σ_{i}}$ is to the "unconstrained optimal value" of $1 / σ_{i}$ . In other words, the larger singular values are more important, so in a constrained optimization we care more about the components of $x$ along the more important input directions $v_{i}$ .^[6]

A general point here - least squares isn't an easy problem to solve in general, unless we have SVD - then it becomes trivial! This is a pretty clear sign that SVD is in some sense the natural form to write a matrix in.

4. Input and Output Directions (like MLPs!)

An MLP can be thought of as a collection of neurons, where each neuron has an input direction (thought of as detecting a feature) and an output direction (outputting a result if that feature is observed). $v_{i}$ and $u_{i}$ are like the input and output directions (minus the nonlinearity in the middle), and furthermore each "neuron" in this case is completely independent (orthogonal).

As we touched on in the previous point, the columns of $V$ can be thought of as input directions for $x$ , and the columns of $U$ are the output directions. This is actually quite similar to how MLPs work! A simple MLP (ignoring biases) is structured like this:

M L P (x) = W^{o u t} f (W^{i n} x)

where $f$ is a nonlinear function which acts element-wise (e.g. ReLU) and $W^{i n}$ , $W^{o u t}$ are the input and output weight matrices respectively. We can write this as a sum over neurons:

M L P (x) = d_{m l p} \sum i = 1 f (W_{[i, :]}^{i n} \cdot x) W_{[:, i]}^{o u t}

in other words, each neuron $i$ has an associated input direction $W_{[i, :]}^{i n}$ and an output direction $W_{[:, i]}^{o u t}$ , and we get the output of the MLP by projecting $x$ along the input direction, ReLUing the results, and using this as our coefficient for the output vector.

Compare this to SVD. We can write $M = U S V^{T} = \sum_{i = 1}^{r} σ_{i} u_{i} v_{i}^{T}$ , so we have:

M x = (r \sum i = 1 σ_{i} u_{i} v_{i}^{T}) x = r \sum i = 1 (σ_{i} v_{i}^{T} x) u_{i}

in other words, we calculate the output of $x$ when put through the linear map $M$ by projecting it along each of the input directions $v_{i}$ , multiplying by scale factors $σ_{i}$ , and using this as our coefficient for the output vector $u_{i}$ .

The main differences between SVD in this form and MLPs are:

MLPs are nonlinear thanks to their ReLU function. SVD is entirely linear.
In MLPs, it's common to have more neurons than dimensions of the input (e.g. in transformers, we usually have 4x more). This means some pairs of neurons are certain to have non-orthogonal input or output directions. In contrast, not only does SVD have $r \leq dim (x)$ , but every pair of input and output directions is guaranteed to be orthogonal. Furthermore, if most singular values are zero (as is the case for large low-rank matrices like $W_{O V}$ ), then $r$ will be much smaller than $dim (x)$ .

These two points help explain why we might expect the SVD of the transformation matrices $W_{O V}$ to be highly interpretable [LW · GW]. Note that we can also view SVD as a way of trying to tackle the "lack of privileged basis" problem - just because the standard basis isn't privileged doesn't mean there can't exist directions in the space which are more meaningful than others, and SVD can be thought of as a way to find them.

5. Lost & Preserved Information

SVD tells you what information (if any) is lost when you use $M$ as a linear map. It also tells you which bits of information are easier / harder to recover.

For any vector $x$ , we can write $x = V y = \sum_{i} y_{i} v_{i}$ (where $v_{i}$ are the columns of $V$ ). Then, we have:

M x = U S V^{T} x = U S V^{T} V y = U S y = \sum i σ_{i} y_{i} u_{i}

So the singular values $σ_{i}$ tell how much we scale the component of $x$ in the $i$ -th input direction $v_{i}$ . If $σ_{i} = 0$ then that component of $x$ gets deleted. If $σ_{i}$ is very close to zero, then that information gets sent to very-near-zero, meaning it's harder to recover in some sense.

This is why doing line plots of the spectra for transformer weight matrices can be quite informative. Often, the largest singular values will dominate, and the rest of them will be pretty small. Take the example below, of the size (1024, 768). Even though the rank of the matrix is technically 768, we can see from the singular values that the matrix is "approximately singular" after a much smaller number of singular values.

Another way of describing this concept is with pseudo-inverses. We say that matrix $N$ is a left-inverse of $M$ if its shape is the transpose of $M$ , and $N M = I$ . If this is impossible (e.g. $M$ has size $(m, n)$ with $m < n$ ) then we can still choose $N$ to get as close as possible to this:

N M = [\begin{matrix} I_{r} & 0 0 & 0 \end{matrix}]

In this case, we call $N$ the "pseudo left-inverse" of $M$ .

What does this look like in SVD? If $M = U S V^{T}$ (where $S$ is the version with all positive diagonal values), then we have $N = V S^{- 1} U^{T}$ as our pseudo left-inverse. We can see that, for singular values $σ_{i}$ close to zero, $N$ will be in danger of blowing up.

6A. Principal Component Analysis

If $M$ is actually a matrix of data, we can interpret the SVD matrices in terms of that data. $V$ tells you which features (= linear combinations of data) explain the most variance in the data, $S$ tells you how large this variance is, and $U$ tells you how much each datapoint is exposed to the corresponding feature.

Suppose $M$ is a matrix of (centered) data, with size $(n, p)$ - i.e. there are $n$ datapoints, and each datapoint has $p$ features. The rows are the datapoints, the columns are the feature vectors. The empirical covariance matrix is given by $Σ = \frac{1}{n - 1} M^{T} M$ , i.e. $Σ_{i j}$ is the estimated covariance of features $i$ and $j$ in the data. When writing this in SVD, we get:

Σ = \frac{1}{n - 1} (U S V^{T})^{T} (U S V^{T}) = \frac{1}{n - 1} V S^{2} V^{T}

This is just $S^{2}$ with respect to the basis of $V$ . Conclusion - the columns of $V$ (which we also call the principal components) are the directions in feature space which have the highest variance, and the (scaled) squared singular values are that variance. Also, note that $V^{T} Σ V$ is a diagonal matrix (with diagonal entries $\frac{σ_{i}^{2}}{n - 1}$ ); this tells us that the "singular features" found in $V$ have zero covariance, i.e. they vary independently.

How does $U$ fit in here? Well, $u_{i} = \frac{1}{σ_{i}} M v_{i}$ , so each element of the vector $u_{i}$ is the dot product of a row of data with the feature loadings for our $i$ -th "singular feature" (scaled by the standard deviation of that feature). From here, it's not a big leap to see that the $i$ -th column of $U$ is the exposure of each datapoint in our matrix to the $i$ -th singular feature.

Note that SVD gives us strictly more information than PCA, because PCA gives us the matrix $Σ$ but not $U$ . This is another illustration of the "SVD is the natural matrix form" idea - when you put a matrix into SVD, other things fall out!

6B. Information Compression

The Fourier transform is a popular choice of basis when compressing things like images (because with a few low-frequency Fourier basis vectors, we can usually reconstruct the image faithfully). We can think of SVD in a similar way, because it gives us the "best basis to use" for reconstructing our image, in some sense.

Suppose we wanted to transmit an image with perfect fidelity. This requires sending $O (width \times height)$ information (the number of pixels). A more common strategy is to take the discrete Fourier transform of an image, and then only send the first few frequencies. This is effective for 2 main reasons:

The Fourier transform is computationally efficient to calculate,
Most images are generally quite continuous, and so low-frequency Fourier basis terms work well for reconstructing them.

But what if we didn't care about efficiency of calculation, and instead we only wanted to minimize the amount of information we had to transmit? Would the Fourier transform always be the best choice? Answer - no, the SVD is provably the best choice (subject to some assumptions about how we're quantifying "best", and "information").^[7]

The algorithm is illustrated below. Algebraically, it's the same as the "best rank- $k$ approximation" formula. We flatten every image in our dataset, stack them horizontally, and get a massive matrix of data. We then perform SVD on this massive matrix.

What's interesting about this is that we can gain insight into our data $M$ by examining the matrices $U$ and $V$ . For instance, if we take the first few columns of $U$ (the "output directions") and reshape them into images of shape (width, height), then we get the "eigenvectors^[8] of our images". Doing this for images of human faces is often called an eigenface, and for a long time it was used in facial recognition software.

Here are the first 8 eigenfaces of an example faces dataset (link here), i.e. the first 8 columns of $U$ reshaped into images:

First 8 singular directions of eigenfaces

This is pretty cool! We're basically getting versions of the "general shape of a human face". The first few capture broad patterns of shading & basic features, while the later ones capture features such as lips, eyes and shadows in more detail.

If we wanted to compress a face image into a small number of dimensions and transmit it, we might find the projections of our face along the first few "eigenfaces". To make this more concrete, for an image $x$ of shape (width, height), we might flatten this into a vector of length $m$ , then calculate the $k$ -dimensional vector $^{T} x$ (which is equivalent to finding the projections of $x$ along the first $k$ columns of $U$ ), and then reconstruct by multiplying by $^U$ .

If we wanted to generate a completely new face from scratch, we could choose a feature vector (i.e. some unit vector in $n$ -dimensional space), and then map it through $^M =^U^S^{T}$ . This would give you a face which has "exposure to the $i$ -th eigenface" equal to the $i$ -th element of your chosen feature vector.

Final Thoughts

Recapping these, we find that the SVD:

Is a decomposition of complicated linear operations into simpler components (rotations and scalings),
Allows us to best approximate a matrix with one of smaller rank,
Is the natural way to express solutions to least-squares type equations,
Gives us a set of independent input and output directions which fully describe the linear transformation,
Tells us what information gets lost and what gets preserved by the linear transformation,
Has a natural interpretation when our matrix is a data matrix (for example, when each datapoint is a flattened image - eigenfaces!).

Appendix - First-Principles Proof of SVD

First, a quick rant. It bugs me how almost all the proofs of SVD use the spectral theorem or some variant. This seems like massive overkill to me, when there's actually a very elegant proof which just uses some basic calculus, and also gets to the essence of SVD in a way that the spectral theorem-based proofs just don't. For that reason, I'm including this proof in the post.

Sketch of proof

Our proof involves choosing $(σ_{i}, u_{i}, v_{i})$ sequentially, until we've spanned all of $M$ . At each step, we find unit vector $v_{i}$ to maximize $∥ M v_{i} ∥$ , subject to $v_{i}$ being orthogonal to our previously chosen vectors. Then we define $σ_{i} = ∥ M v_{i} ∥$ and $u_{i} = \frac{1}{σ_{i}} M v_{i}$ . The only non-trivial part of our proof will be showing that $u_{i}$ are orthogonal to each other. This will involve a short geometric argument.

Actual proof

We'll sequentially choose $(σ_{i}, u_{i}, v_{i})$ , using the following algorithm:

We define $v_{i} = max ∥ v ∥ = 1 ∥ M v ∥$ subject to the restriction $v_{i} ⊥ v_{j}$ for all $j < i$
We define $σ_{i} = ∥ M v_{i} ∥$ and $u_{i} = \frac{1}{σ_{i}} M v_{i}$ .

Most of the properties of SVD are already proved from this algorithm. By our definition, $v_{i}$ are orthogonal unit vectors, $u_{i}$ are unit vectors, and $σ_{i}$ are strictly positive & non-increasing (because each $v_{i}$ is chosen with more restrictions than the previous one). The algorithm terminates when $M v_{i} = 0$ for all possible choices of $v_{i}$ , at which point the $v$ -vectors we've chosen so far must span the domain of $M$ , and we're done. The only thing left is to show that the $u$ -vectors are orthogonal.

Suppose $j < i$ , and so $σ_{j} > σ_{i} > 0$ . We can define the function $f (θ) = ∥ M v_{j} cos θ + M v_{i} sin θ ∥^{2}$ . We know that $v_{j}$ was chosen to maximise $∥ M v_{j} ∥$ subject to orthogonality with the other $v$ -vectors, which means (since $v_{i}$ is also orthogonal to the other $v$ -vectors) that $θ = 0$ must be a stationary point of the function $f$ . But if we Taylor-expand $f (θ)$ around $θ = 0$ , we get:

f (θ) = ∥ M v_{j} cos θ + M v_{i} sin θ ∥^{2} = ∥ M v_{j} ∥^{2} + 2 (M v_{i})^{T} (M v_{j}) θ + O (θ^{2})

and so $f^{'} (0) = 0$ implies $(M v_{i})^{T} (M v_{j}) = 0$ , hence $u_{i}^{T} u_{j} = 0$ , as required.

The image below shows the geometric intuition for this proof. If $M v_{i}$ and $M v_{j}$ weren't orthogonal, then we'd be able to define $v_{j}^{'}$ by rotating $v_{j}$ a small amount wrt $v_{i}$ , and this would result in a larger vector $M v_{j}$ (contradiction, since $v_{j}$ was chosen to maximise $M v_{j}$ ).

Appendix - Eckart-Young Theorem

Firstly, let's prove the two lemmas.

The squared Frobenius norm of a matrix equals the sum of its squared singular values.

Sketch of proof - left-multiplying $M$ by an orthogonal matrix is equivalent to doing a unitary operation on the columns of $M$ . Since the squared Frobenius norm is the sum of squared $L_{2}$ norms of columns, and unitary operations don't change the $L_{2}$ norm, $M$ and $U^{T} M = S V^{T}$ must have the same Frobenius norm. A similar argument (rows rather than cols) shows that we can right-multiply by an orthogonal matrix without changing the Frobenius norm. So $M$ and $U^{T} M V = S$ have the same Frobenius norm. But $S$ is a diagonal matrix elements equal to the singular values, so $∥ S ∥_{F}^{2}$ is clearly the sum of squared singular values.

The spectral norm of a matrix is its largest singular value.

This follows directly from the way we proved the SVD - we chose $v_{1}$ to maximize $∥ M v ∥$ over all possible vectors $v$ of unit norm.

Now, let's prove the full theorem. Both spectral and Frobenius norm are unchanged by unitary operations, so we have $∥ M - M_{k} ∥ = ∥ S - A_{k} ∥$ for both types of norm (where $A_{k} = U^{T} M_{k} V$ has the same rank as $M_{k}$ ). If we choose $A_{k} = S_{k}$ (the diagonal matrix formed from the first $k$ singular values of $S$ ), then we get $∥ S - A_{k} ∥_{2} = σ_{k + 1}$ , and $∥ S - A_{k} ∥_{F}^{2} = \sum_{i = k + 1}^{r} σ_{i}^{2}$ . It remains to prove that we can't do better than this.

For a general rank- $k$ matrix $A_{k}$ , we can find a vector $f$ in the span of the standard basis vectors $(e_{1}, e_{2}, . . ., e_{k + 1})$ s.t. $f$ is in the nullspace of $A_{k}$ . Then we have:

∥ S - A_{k} ∥_{2} \geq ∥ (S - A_{k}) f ∥_{2} = ∥ S f ∥ \geq ∥ σ_{k + 1} f ∥ = σ_{k + 1}

proving the result for spectral norm. Similarly, we can find vectors $f_{k + 1}, . . ., f_{r}$ in the span of $(e_{1}, e_{2}, . . ., e_{r})$ which are all in the nullspace of $A_{k}$ . Using the invariance of the Frobenius norm to changes in basis, we have:

∥ S - A_{k} ∥_{F}^{2} \geq r \sum i, j = k + 1 (f_{i}^{T} (S - A_{k}) f_{j})^{2} = r \sum i, j = k + 1 (f_{i}^{T} S f_{j})^{2} = r \sum i = k + 1 σ_{i}^{2}

so we're done.

^{^}
Citation needed.
^{^}
Note, the Eckart-Young Theorem is sometimes used to refer to the Frobenius norm, sometimes to the spectral norm, and sometimes to both.
^{^}
For more on this, see Evan Chen's Napkin Linear Algebra section, or his rant on why matrices are not arrays of numbers.
^{^}
Note that this terminology is not standard (as far as I know).
^{^}
The proof is left as an exercise to the reader. Hint - start by writing the Lagrangian $L (x, λ) = ∥ M x - b ∥_{2}^{2} + λ (m^{2} - ∥ x ∥_{2}^{2})$ .
^{^}
There's a similar story if we had the constraint $∥ x ∥_{1} \leq m$ . Our solution here is found by trying to set $x_{v_{i}} = b_{u_{i}} / σ_{i}$ for as many $i = 1, 2, . . .$ as possible, until we hit our constraint. In other words, we only care about matching the components of $x$ along the most important input directions (unlike the $L_{2}$ -restricted case, where we just weighted the more important input directions more highly than the less important ones).
^{^}
That is, if we're quantifying "best" by the norm of the approximation residual, for a choice of norm like Frobenius or $L_{2}$ (see the section on least squares).
^{^}
They're eigenvectors of the matrix $M M^{T} = U S^{2} U^{T}$ .

1 comments

Comments sorted by top scores.

comment by Joseph Van Name (joseph-van-name) · 2023-07-08T19:59:42.674Z · LW(p) · GW(p)

The singular value decomposition is great and it even works well for complex and quaternionic matrices (and we can even generalize decompositions like the spectral, polar, and singular value decomposition and apply them to bounded linear operators between infinite dimensional Hilbert spaces), but to get a better appreciation of the singular value decomposition, we ought to analyze its deficiencies as well. I am currently researching other linear dimensionality reduction techniques (namely LSRDRs) that work well in the cases when the singular value decomposition is not the best technique to use for a linear dimensionality reduction. These cases include the following:

While the SVD approximates a matrix with a low rank matrix, it does not generalize that well to higher order SVDs that decompose tensors in where $V_{1}, \dots, V_{n}$ are vector spaces.
The SVD works applies to linear mappings between inner product spaces, but the SVD does not take any additional structure that the linear mappings or inner product spaces have. For example, if we had a tuple of vectors $(v_{1}, \dots, v_{r})$ , then we may want to use a linear dimensionality reduction that does not just consider $(v_{1}, \dots, v_{r})$ as a matrix. For example, it is more meaningful to consider a weight matrix in a neural network as a tuple of row vectors or a tuple of column vectors than just a matrix without additional structure.
If we apply the principal component analysis to a collection $(v_{1}, \dots, v_{r})$ of vectors (with mean 0 for simplicity), then the $k$ -dimensional subspace $M$ that best approximates $(v_{1}, \dots, v_{r})$ may fail to cluster together. For example, suppose that $X_{1}, \dots, X_{s}, Y_{1}, \dots, Y_{s}$ are independent normally distributed random variables each with mean 0 where each $X_{j}$ has covariance matrix $I_{n} \oplus 0.1 \cdot I_{n}$ while each $Y_{j}$ has covariance matrix $0.1 \cdot I_{n} \oplus I_{n}$ . If we take a random sample $(x_{1}, \dots, x_{s}, y_{1}, \dots, y_{s})$ from $X_{1}, \dots, X_{s}, Y_{1}, \dots, Y_{s}$ and then perform a principal component analysis to $(x_{1}, \dots, x_{s}, y_{1}, \dots, y_{s})$ to find an $n$ -dimensional subspace of $R^{n} \oplus R^{n}$ , then the principal component analysis will not tell you anything meaningful. We would ideally want to use something similar to the principal component analysis but which instead returns subspaces that are near $R^{n} \oplus {0}$ or ${0} \oplus R^{n}$ . The principal component analysis returns the top $k$ -dimensional affine subspace of a vector space in magnitude, but the principal component analysis does not care if the canonical basis for these $k$ -dimensions form a cluster in any way.
Every real, complex, or quaternionic matrix has an SVD, and the SVD is unique (except in the case where we have repeated singular values). While mathematicians tend to like it when something exists and is unique, and computer scientists may find the existence and uniqueness of the singular value decomposition to be useful, the existence and uniqueness of the SVD does have its weaknesses (and existence and uniqueness theorems imply a sort of simplicity that is not the best indicator of good mathematics; good mathematics is often more complicated than what you would get from a simple existence and uniqueness result). One should consider the SVD as a computer program, programming language, and piece of code that always return an output regardless of whether the output makes sense without ever producing an error message. This makes it more difficult to diagnose a problem or determine whether one is using the correct tools in the first place, and this applies to the singular value decomposition as well. The poor behavior of an algorithm could also provide some useful information. For example, suppose that one is analyzing a block cipher round function $E$ using an algorithm $L$ . If the algorithm $L$ produces errors for complicated block cipher round functions but does not produce these errors for simple block cipher round functions, then the presence of one or more errors indicates that the block cipher round function $E$ is secure.
If $A$ is a real matrix, but we take the complex or quaternionic SVD of $A$ to get a factorization $A = U D V^{*}$ , then the matrices $U, V$ will be real orthogonal matrices instead of complex or quaternionic matrices. This means that the SVD of a matrix is always well-behaved which is again a problem since this well-behavedness does not necessarily mean that the singular value decomposition is useful for whatever circumstances we are using it for and the poor behavior of a process may provide useful information.
The SVD is not exactly new or cutting edge, so it will give one only a limited amount of information about matrices or other objects.

Let $K$ denote either the field of real numbers, the field complex numbers, or the division ring of quaternions. Suppose that $A_{1}, \dots, A_{r}$ are $n \times n$ -matrices with entries in $K$ . If $X_{1}, \dots, X_{r}$ are $d \times d$ -matrices, then define an superoperator $Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}) : M_{n, d} (K) \to M_{n, d} (K)$ by letting $Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}) (X) = A_{1} X X_{1}^{*} + \dots + A_{r} X X_{r}^{*}$ whenever $X \in M_{n, d} (K)$ . Define a partial (but nearly total) function $F_{A_{1}, \dots, A_{r}; K} : M_{d} (K)^{r} \to [0, \infty)$ by letting

$F_{A_{1}, \dots, A_{r}; K} (X_{1}, \dots, X_{r}) = \frac{ρ (Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}))}{ρ (Φ (X_{1}, \dots, X_{r}))^{1 / 2}}$ . Here, let $ρ$ denote the spectral radius of a linear operator. We say that $(X_{1}, \dots, X_{r})$ is a $L_{2, d}$ -spectral radius dimensionality reduction (LSRDR) of type $K$ if the quantity $F_{A_{1}, \dots, A_{r}; K} (X_{1}, \dots, X_{r})$ is locally maximized.

One can compute LSRDRs using a flavor of gradient ascent. Don't worry. Taking an approximate gradient of the $F_{A_{1}, \dots, A_{r}; K}$ is less computationally costly than it sounds, and the gradient ascent should converge to an LSRDR $(X_{1}, \dots, X_{r})$ . If the gradient ascent process fails to quickly converge to an LSRDR $(X_{1}, \dots, X_{r})$ , then LSRDRs may not be the best tool to use.

We say that $(X_{1}, \dots, X_{r}), (Y_{1}, \dots, Y_{r})$ are projectively similar and write $(X_{1}, \dots, X_{r}) ≃_{K} (Y_{1}, \dots, Y_{r})$ if there is some $α \in Z (K)$ ( $Z (K)$ denotes the center of $K$ ) and some invertible matrix $R$ such that $X_{j} = α R Y_{j} R^{- 1}$ for $1 \leq j \leq r$ . Let $[X_{1}, \dots, X_{r}]_{K}$ denote the equivalence class containing $(X_{1}, \dots, X_{r})$ .

The equivalence class $[X_{1}, \dots, X_{r}]_{K}$ of an LSRDR of type $K$ of $(A_{1}, \dots, A_{r})$ is often unique. At the very least, one should only be able to find a few equivalence classes $[X_{1}, \dots, X_{r}]_{K}$ of LSRDRS of type $K$ of $(A_{1}, \dots, A_{r})$ , and the equivalence class $[X_{1}, \dots, X_{r}]_{K}$ of LSRDRs with highest fitness should also be the easiest to find. But if the equivalence class $[X_{1}, \dots, X_{r}]_{K}$ is far from being unique, then this should be an indicator that the notion of taking an LSRDR may not be the best tool to use for analyzing $(A_{1}, \dots, A_{r})$ , so one should try something else in this case.

If $A_{1}, \dots, A_{r}$ are all real matrices but $K = C$ , then the equivalence class $[X_{1}, \dots, X_{r}]_{K}$ of the LSRDR should contain a tuple $(Y_{1}, \dots, Y_{r})$ where each $Y_{i}$ is a real matrix. One can quickly test whether one should be able to find such a tuple $(Y_{1}, \dots, Y_{r})$ given an LSRDR $(X_{1}, \dots, X_{r})$ is to compute $\frac{Tr (X_{i})}{Tr (X_{j})}$ . If $\frac{Tr (X_{i})}{Tr (X_{j})}$ is a real number (up-to a rounding error), then that means that the LSRDR is well-behaved and perhaps an appropriate tool to use, but otherwise the LSRDR may not be the best tool to use.

If we find our LSRDR $(X_{1}, \dots, X_{r})$ of type $K$ of $(A_{1}, \dots, A_{r})$ , then if everything works out well, there should be some matrices $R, S$ where $X_{j} = R A_{j} S$ for $1 \leq j \leq s$ and where $R S = λ \cdot 1_{d}$ and where $S R = λ \cdot P$ for some (not necessarily orthogonal) projection matrix $P$ and constant $λ \in Z (K)$ . If $λ = 1$ , then we say that $R, S$ is constant factor normalized; if $R, S$ is constant factor normalized, then $R S = 1_{d}, S R = P$ , so let us assume that $R, S$ is constant factor normalized to make everything simpler. Let $U_{R}$ be the dominant eigenvector of $Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r})$ , and let $U_{L}$ be the dominant eigenvector of $Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r})$ . Then there are positive semidefinite matrices $G, H$ and non-zero constants $μ_{G}, μ_{H} \in Z (K)$ where $U_{R} S^{*} = μ_{H} \cdot H, U_{L} R = μ_{G} \cdot G$ . The projection matrix $P$ should be recovered from the positive semidefinite matrices $G, H$ since $Im (H) = Im (P), Im (G) = ker (P)^{⊥}$ , and the positive semidefinite matrices $G, H$ (up-to a constant real factor) should be uniquely determined. The positive semidefinite matrices $G, H$ should be considered to be the dominant clusters of dimensions for $(A_{1}, \dots, A_{r})$ .

Order 2 tensors: Suppose that $v_{1}, \dots, v_{r} \in V$ for some finite dimensional real inner product space $V$ . Then set $A_{j} = v_{j} v_{j}^{*}$ for $1 \leq j \leq r$ . Then $G = H$ , so the positive semidefinite matrix $G$ is our desired dimensionality reduction of $v_{1}, \dots, v_{r}$ . For example, if $M$ is a weight matrix in a neural network, then we can make $v_{1}, \dots, v_{r}$ the columns of $M$ , or we can make $v_{1}, \dots, v_{r}$ the transposes of the rows of $M$ . Since we apply activation functions before and after we apply $M$ , it makes sense to separate $M$ into rows and columns this way. And yes, I have performed computer experiments that indicate that for $A_{j} = v_{j} v_{j}^{*}$ , the matrices $G, H$ do represent a cluster of dimensions (at least sometimes) rather than simply the top $d$ dimensions. I have done the experiment where $(v_{1}, \dots, v_{r}) = (x_{1}, \dots, x_{s}, y_{1}, \dots, y_{s})$ and in this experiment, the matrices $G, H, P$ (up to a constant factor for $G, H$ ) are all approximately the projection matrix that projects onto the subspace $R^{n} \oplus {0}$ .

Order 3 tensors: Suppose that $V, W$ are finite dimensional real or complex inner product spaces and $A : V \to V \otimes W$ is a linear mapping. Observe that $L (V, V \otimes W)$ is canonically isomorphic to $V \otimes V \otimes W$ . Now give $W$ an orthonormal basis $e_{1}, \dots, e_{r}$ , and set $A_{j} = (1_{V} \otimes e_{j}^{*}) A$ for $1 \leq j \leq r$ . Then one can apply an LSRDR to $A_{1}, \dots, A_{r}$ to obtain the positive semidefinite matrices $G, H$ . The positive semidefinite matrices $G, H$ do not depend on the orthonormal basis $e_{1}, \dots, e_{r}$ that we choose. For example, suppose that $O_{1}, O_{2}$ are open subsets of Euclidean spaces of possibly different dimensions and $f : O_{1} \to O_{2}$ is a $C^{2}$ -function where there are $f_{1}, \dots, f_{r} : O_{1} \to R$ where $f (x) = (f_{1} (x), \dots, f_{r} (x))$ for each $x \in O_{1}$ . Then let $A_{j} = H (f_{j}) (x)$ for $1 \leq j \leq r$ where $H (f_{j})$ denotes the Hessian of $f_{j}$ . Then the matrices $G, H$ of an LSRDR of $A_{1}, \dots, A_{r}$ represent a cluster of dimensions in the tangent space at the point $x$ .

Order 4 tensors: Given a vector space $V$ , let $L (V)$ denote the collection of linear maps from $V$ to $V$ . Let $V$ be a finite dimensional complex inner product space. Then there are various ways to put $V \otimes V \otimes V \otimes V$ into a canonical one-to-one correspondence with $L (L (V))$ . Furthermore, the Choi representation gives a one-to-one correspondence between the completely positive operators in $L (L (V))$ and the positive semidefinite operators in $L (V \otimes V)$ . An operator $E \in L (L (V))$ is completely positive if and only if there are $A_{1}, \dots, A_{r} \in L (V)$ where $E (X) = A_{1} X A_{1}^{*} + \dots + A_{r} X A_{r}^{*}$ for all $X \in L (V)$ . Therefore, whenever $E$ is completely positive, we compute a complex LSRDR $(X_{1}, \dots, X_{r})$ of $(A_{1}, \dots, A_{r})$ , and we should get matrices $R, S, P, G, H$ , and $G, H$ give us our desired dimensionality reduction. Of course, given an order 4 tensor, one has to ask whether it is appropriate to use LSRDRs for this order 4 tensor, and one should ask about the best way to use these order 4 tensors to produce an LSRDR.

If this comment were not long enough already, I would give an explanation for why I believe LSRDRs often behave well, but this post is really about the SVDs so I will save my math for another time.

Six (and a half) intuitions for SVD

Contents

Summary

Introduction

Motivation

Notation

6 ½ Intuitions

1. Rotations and Scalings (geometric picture)

2. Best Low-Rank Approximations

3. Least Squares Regression

4. Input and Output Directions (like MLPs!)

5. Lost & Preserved Information

6A. Principal Component Analysis

6B. Information Compression

Final Thoughts

Appendix - First-Principles Proof of SVD

Appendix - Eckart-Young Theorem

1 comments