Insights from Linear Algebra Done Right

post by sil ver (sil-ver) · 2019-07-13T18:24:50.753Z · score: 53 (23 votes) · LW · GW · 15 comments

This book has previously been discussed by Nate [LW · GW] Soares [LW · GW] and TurnTrout [LW · GW]. In this... review? report? detailed journal entry? ... I will focus on the things which stood out to me. (Definitely not meant to be understandable for anyone unfamiliar with the field.) The book has ten chapters; I did most of the exercises in chapters 6 to 10. I got through that half of the book in about 3 weeks, which was nice since the previous (harder and longer) textbook on topology I worked through took me about a year.

I've previously taken two introductory courses to liner algebra and came out thinking of the field as large and very unstructured, with lots of concepts and theorems floating around disconnectedly. What is a normal matrix? I might or might not have remembered the definition, but I certainly had no idea what it is good for. Three weeks of intense study with this book has improved my understanding dramatically. Now, the field seems structured and pretty intuitive (and I certainly won't forget what a normal operator is). It is truly hard to overstate how good of a job this book does teaching the subject, compared to what I've seen before. It's probably the best textbook on anything I've read so far.

Chapter 1 introduces complex numbers and vectors spaces.

Chapter 2 introduces finite-dimensional vector spaces. What stands out is just how nicely behaved they are. Every finite-dimensional real or complex vector space is isomorphic to either or . Take any linearly independent vectors, and they span a subspace of dimension . If they're not linearly independent, then one vector is in the span of the previous vectors. Any list of linearly independent vectors make up a basis for the entire space. Basically every result one would want to hold when looking back onto the material does actually hold.

Like most fields in math, Linear Algebra ultimately cares about studying functions, and once again it only cares about a tiny subset of all possible functions. In this case it is linear maps, which are the focus of Chapter 3. These are functions such that and for all and , where and are vector spaces over some field (in this book, always or ).

Brief comparison to topology: The word "continuity" is only mentioned in the final chapter of this book, but every linear map can be shown to be continuous if the topologies on both vector space are given by an inner product. Actually, every inner product induces a norm, every norm induces a metric, any metric induces a topology, but none of the reverse steps are true. Put differently, metric topologies are particularly nice topologies, metrics from norms are particularly nice metrics, norms from inner-products are particularly nice norms, and inner products are the concept studied in chapters 6 and 7. So Linear Algebra only even bothers with the the super nicely behaved stuff.

I'll now detail my biggest confusion with LA before reading this book (this is something I've even brought up explicitly, but without getting a satisfying answer): if is a linear map, then it is actually possible to determine its behavior entirely by fixing bases of and of and storing six numbers

which tell us that and . (This determines the behavior of on all vectors because of linearity of and the fact that and are bases). The matrix is dependent on the two bases chosen here (given these bases, the set of all linear maps from to is bijective to that of 3-by-2 matrices).

Linear maps aren't relative to bases. Vectors were introduced as elements of , so not relative to bases either, but then they were introduced again in "coordinate form", where we write ( is our basis) to mean " times the first basis vector plus times the second," and suddenly a dependence to our basis has been created. My confusion was then: is the in the index just a reminder that is understood to be relative to that basis, or is it a function that maps the vector to the vector ? In the former case, how should one differentiate between coordinate vectors and non-coordinate vectors? (We've used the same notation.) And how do we specify the vectors in a basis? If I write and that means "1 times the first basis vector, 0 times the second" while trying to specify my basis, then clearly that's no good. And the latter case is even worse: now matrix-with-vector-multiplication is ill-defined. Consider the standard basis on and the basis with both vectors reversed. Then we have

But, clearly

So we have a flat out contradiction. And it gets worse yet if is a matrix and one defines a linear map by , which was actually done in one of my past lectures.

So that was the problem. In contrast, here's what the book does. Vectors are elements in , not relative to bases. Maps are functions , likewise not relative to bases. Matrices are a bunch of numbers put into an array, and one defines the matrix of a linear map , denoted , in the way I've described above (so relative to two bases). Most importantly, there is no matrix-vector multiplication. There is matrix-matrix multiplication, which is of course fine because both matrices are relative to bases if they come from linear maps. To "apply" a matrix to a vector, one defines, given a vector and a basis , the matrix of a vector as the unique -by- matrix

whose entries are the scalars needed to write as a sum of basis vectors of . Then one proves that . Note the square brackets: the book uses brackets for lists (lists are the same as tupels, i.e. elements of and vectors are lists), and brackets for matrices only, which is a solid convention. Does this approach resolve this problem to full satisfaction and without introducing any new difficulties? I think so.

Chapter 4 introduces polynomials. It doesn't have much LA in it, but understanding polynomials well is important for LA, because polynomials can be applied to operators (those are linear maps of the form , i.e. from a vector space into itself). This is so because, given , we can define if and and (the identity map).

The other special thing about the book – and the justification the author gives for calling it "Linear Algebra Done Right" – is that determinants aren't introduced until the final chapter. This is very unusual: most courses, such as the lectures I took, frequently use determinants in proofs. But all of these proofs can be done without them. And they tend to be fairly simple, too; only a couple are longer than a page.

Doing this, one finds that there are striking parallels between properties about polynomials and properties about linear vector spaces, which seems a lot better for developing a useful intuition than what one gets by doing determinant-based proofs. The central theorem about operators introduced in Chapter 5, for example, states that every operator on a finite-dimensional complex vector space has an eigenvalue (that is, a scalar such that there exists a nonzero vector such that ). This can be proven by essentially applying the fundamental theorem of algebra. One chooses any nonzero vector and constructs the list , where . These vectors cannot be linearly independent because they are many, so there exist scalars such that . Now (or rather, after doing some preliminary work) the fundamental theorem of algebra kicks in, and we can reformulate this as where the are the roots of the polynomial . Since the RHS equals 0, one of the isn't injective, and therefore has the respective as an eigenvalue. This is done properly in the book in relatively little space. It also shows why the same theorem fails to hold on real vector spaces: the reformulation of the polynomial to might no longer be possible.

Chapter 6 introduces inner products. These are the generalized formalization underlying the concept of angle between vectors. There is a famous inequality, called the Cauchy-Schwartz inequality, which states that . The book gives a very nice proof for this that I find easier to remember than the proofs I've seen previously. It is based on the orthogonal decomposition of a vector, where if and are any nonzero vectors, then equals for some vector and some scalar such that and are orthogonal to each other. I hadn't heard of this before and it's quite useful. If one remembers the formula for the orthogonal decomposition, plugs it into the inner product and does the obvious things, the Cauchy-Schwartz inequality pops out at the other end.

The absolute most nicest things ever studied anywhere are orthonormal bases. These are lists of vectors such that a) they are all linearly independent, b) they span the entire space (follows from a) if there are many), c) they are all orthogonal to each other and d) they all have length 1. Because this is LA, orthonormal bases always exist and there is even an explicit algorithm to construct one from any existing non-orthonormal basis. Something really cool that the book does with this (which also requires a neat theorem about minimal distance of a vector to a subspace) is to construct the polynomial of degree at most 5 that is the most similar to the function on the interval (based on some reasonable definition of similarity involving integrals). I say this is really cool because I found it to be a surprisingly strong result, and a surprising kind of result coming out of linear algebra, given that there is really nothing linear about the sin function. On the interval , this approximation looks indistinguishable to sin to the normal eye.

Back to orthonormal bases: what their existence means is that, whenever one tries to prove anything involving inner product spaces, one can simply say something like, "let be an orthonormal basis for and let be an orthonormal basis for ", and then proceed to reduce whatever argument one needs to the basis elements. This feels very powerful, even if arbitrarily difficult problems exist regardless.

With Chapter 7, we're getting to the big classification theorems. LA wants to study linear maps, and in this chapter it classifies all of the linear maps such that there exists an orthonormal basis relative to which they have diagonal form. This is the aforementioned spectral theorem. These are the normal operators on complex vector spaces, and the self-adjoint operators (which is a strictly stronger property) on real vector spaces. This highlights another recurring pattern: there are often two versions of each result, one super nice one about complex vector spaces, and a less nice, more cumbersome one about real vector spaces. This makes a lot of sense given that is far more nicely behaved than (this is kind of why we defined complex numbers in the first place). More specifically, it usually comes down to the fact that every polynomial of degree has a root in , but not necessarily in , just like it did in the aforementioned result on the existence of eigenvalues. The book also draws an interesting parallel between self-adjoint operators and real numbers.

In Chapters 8 and 9, we return to the study of eigenvalues. Basically, the problem is as follows: given an operator , one would like to find a decomposition of into smaller subspaces such that a) each vector can be uniquely written as a weighted sum of the , i.e. with and b) the operator maps each subspace into itself, i.e. . If this is achieved, then the behavior of is determined entirely by the behavior of the (i.e. restricted to ); and this is great because the are much easier to handle than . In fact, if the are one-dimensional vector spaces (i.e. just straight lines, if interpreted geometrically) then just multiplies each vector with a scalar. In other words, . This is of course just the equation that says that is an eigenvalues of with nonzero eigenvector , and so that's the reason why eigenvalues are of interest. Conversely, given an eigenvalue with eigenvector , the corresponding one-dimensional subspace is simply .

As stated before, every operator on a complex vector spaces has some eigenvalue. But does it have different eigenvalues, so that ? (The symbol means "direct sum".) In "most" cases, the answer is yes – by which I mean that, if one generates an operator randomly (by sampling the coefficients of its matrix relative to the standard basis randomly out of some large finite set), then as one increases the size of that set (for example by switching from 32 bit to 64 bit to 128 bit floating point arithmetic and so on), the probability of generating an operator for which this doesn't work converges to 0. And I believe it is also possible to define a non-discrete probability space such as , give it the uniform distribution, and then prove that the set of operators for which this doesn't work has probability mass 0.

So for "typical" operators, this is always true, but there do exist specific operators for which it isn't. One such example is the operator whose matrix wrt the standard basis is . This operator has the eigenvalue but the only eigenvectors are of the form . Thus there exists one one-dimensional subspace that is invariant under , namely , but not a second one, so there isn't any decomposition of into two invariant subspaces.

To remedy this, Chapter 8 introduces generalized eigenvectors. Observe that the property can also be written as (where is the identity map) or as , where , i.e. is the one-dimensional vector space containing . The equation above then says that, if we take , restrict it to and subtract times the identity operator, we get the zero operator (so the 0 on the right denotes the null map ). This reformulation leads to the non-obvious definition of a generalized eigenvector: given an eigenvalue of with corresponding one-dimensional subspace , a generalized eigenvector of is a vector such that for some . Which is to say, subtracting times the identity operator from doesn't have to yield the null operator immediately, but it does have to yield the null operator if the resulting operator is applied several times.

In our example above, only vectors of the form are eigenvectors, but all vectors are generalized eigenvectors, because is the operator whose matrix wrt to the standard basis is , and this operator equals the null operator if it is applied twice, i.e. .

If all of the above is done properly, then one gets something that comes very close to the perfect decomposition for any operator on a complex vector space. This is the theory behind the famous Jordan form for matrices: it is not quite a diagonal matrix (which corresponds to the perfect decomposition), but it almost is. It just has a bunch of additional 1s to deal with the generalized eigenvectors.

Chapter 9 then does a similar thing for real vector spaces. As always, it's more cumbersome and the result is weaker, but it's still quite strong.

At last, Chapter 10 introduces determinants, and it's actually kind of boring! All of the interesting LA has already been done, so this chapter feels like a mere afterthought. Again, one doesn't seem to need determinants to do basic LA.


The next big item on my list is Understanding Machine Learning, which is also from Miri's guide. I've so far neglected trying to get familiar with actual AI research, and it's time to remedy that, or at least get a solid grasp of the basics.

15 comments

Comments sorted by top scores.

comment by crabman · 2019-07-14T18:42:50.106Z · score: 8 (3 votes) · LW · GW

I wonder, what do you think about the chapter about dual spaces, dual maps, annihilator, etc.? To me it seemed not too connected with everything else, and that's bad. If I remember correctly, the author uses duality just to prove a few results and then throws duality away and never uses it again. Also in real life (numerical linear algebra, machine learning, and stuff) I am not aware of any use for those concepts.

So for “general” operators, this is always true, but there do exist specific operators for which it isn’t.

I believe when mathematicians say that in general P(x) holds, they mean that for any x in the domain of interest P(x) holds. Perhaps you want to you typical instead of general here. E.g. there is a notion called typical tensor rank of tensors of given shape, which means a tensor rank which occurs with non-zero probability when a random tensor of given shape is sampled.

comment by habryka (habryka4) · 2019-07-14T18:46:26.405Z · score: 2 (1 votes) · LW · GW

I've used duality at least a bit in my subsequent ML classes, so I was happy that the book covered it. I do think I remember it not being used super much in the rest of the book.

comment by crabman · 2019-07-14T18:59:19.663Z · score: 1 (1 votes) · LW · GW

Do you mean solving convex optimization problems by solving their dual problems instead?

comment by habryka (habryka4) · 2019-07-14T19:16:50.940Z · score: 2 (1 votes) · LW · GW

Yeah, that was one of the applications.

comment by crabman · 2019-07-14T19:22:49.628Z · score: 3 (2 votes) · LW · GW

Ok. It's just that when I learned that, we didn't even talk about dual spaces in linear algebraic sense, we worked just fine in .

comment by sil ver (sil-ver) · 2019-07-14T21:02:41.828Z · score: 1 (1 votes) · LW · GW
I wonder, what do you think about the chapter about dual spaces, dual maps, annihilator, etc.?

Nothing, because it wasn't in the material. I worked through the second edition of the book, and the parts on duality seem to be new to the third edition.

I believe when mathematicians say that in general P(x) holds, they mean that for any x in the domain of interest P(x) holds. Perhaps you want to you typical instead of general here. E.g. there is a notion called typical tensor rank of tensors of given shape, which means a tensor rank which occurs with non-zero probability when a random tensor of given shape is sampled.

Thanks for that, I changed it.

comment by SatvikBeri · 2019-07-14T22:47:31.261Z · score: 7 (3 votes) · LW · GW

For the orthogonal decomposition, don't you need two scalars? E.g. . For example, in , let Then , and there's no way to write as

comment by sil ver (sil-ver) · 2019-07-14T22:56:36.507Z · score: 2 (2 votes) · LW · GW

Ow. Yes, you do. This wasn't a typo either, I remembered the result incorrectly. Thanks for pointing it out, and props for being attentive enough to catch it.

Or to be more precise, you only need one scalar, but the scalar is for not , because isn't given. The theorem says that, given and , there is a scalar and a vector such that and is orthogonal to .

comment by Hazard · 2019-07-13T21:28:30.181Z · score: 7 (4 votes) · LW · GW

Yay learning more math and posting about it!

comment by limerott · 2019-07-16T13:42:15.850Z · score: 2 (2 votes) · LW · GW

Your struggles with coordinate vectors sound familiar. I remember that, in the book I used, they introduced for a basis and called a coordinate vector of iff ( is left general on purpose). This explicit formulation as a simple bijective function cleared up the confusion for me. You can do neat things with it like turn an abstract map into a map from one basis to another: for linear .

But of course, from a practical standpoint, you don't think about vectors this way. As with proofs, once you heard a plausible argument and accepted it, there is no reason to go back to them. So, I am somewhat critical of your book not introducing matrix-vector multiplication, as this is an obvious centerpiece of linear algebra (especially considering that its historic roots lie in solving Ax=b).

comment by sil ver (sil-ver) · 2019-07-16T19:18:59.515Z · score: 1 (1 votes) · LW · GW

That looks like it also works. It's a different philosophy I think, where LADR says "vectors and matrices are fundamentally different objects and vectors aren't dependent on bases, ever" and your view says "each basis defines a bijective function that maps vectors from the no-basis world into the basis-world (or from the basis1 world into the basis2 world)" but it doesn't insist on them being fundamentally different objects. Like if then they're the same kind of object, and you just need to know which world you're in (i.e. relative to which basis, if any, you need to interpret your vector to).

II don't think not having matrix-vector multiplication is an issue. The LADR model still allows you to do everything you can do in normal LA. If you want to multiply a matrix with a vector , you just make into the n-by-1 matrix and then multiply two matrices. So you multiply rather than . It forces you to be explicit about which basis you want the vector to be relative to, which seems like a good thing to me. If is the standard basis, then will have the same entries as , it'll just be written as rather than .

comment by TheMajor · 2019-07-17T11:40:38.020Z · score: 1 (1 votes) · LW · GW

Just to share my two cents on the matter, the distinction between abstract vectors and maps on the one hand, and columns with numbers in them (confusingly also called vectors) and matrices on the other hand, is a central headache for Linear Algebra students across the globe (and by extension also for the lecturers). If the approach this book takes works for you then that's great to hear, but I'm wary of `hacks' like this that only supply a partial view of the distinction. In particular matrix-vector mulitplication is something that's used almost everywhere, if you need several translation steps to make use of this that could be a serious obstacle. Also the base map that limerott mentions is of central importance from a category-theoretic point of view and is essential in certain more advanced fields, for example in differential geometry. I'm therefore not too keen on leaving it out of a Linear Algebra introduction.

Unfortunately I don't really know what to do about this, like I said this topic has always caused major confusion and the trade-off between completeness and conciseness is extremely complicated. But do beware that, based on only my understanding of your post, you might still be missing important insights about the distinction between numerical linear algebra and abstract linear algebra.

comment by sil ver (sil-ver) · 2019-07-17T19:07:00.464Z · score: 2 (3 votes) · LW · GW

I honestly don't think the tradeoff is real (but please tell me if you don't find my reasons compelling). If I study category theory next and it does some cool stuff with the base map, I won't reject that on the basis of it contradicting this book. Ditto if I actually use LA and want to do calculations. The philosophical understanding that matrix-vector multiplication isn't ultimately a thing can peacefully coexist with me doing matrix-vector multiplication whenever I want to. Just like the understanding that the natural number 1 is a different object from the integer number 1 peacefully coexists with me treating them as equal in any other context.

I don't agree that this view is theoretically limiting (if you were meaning to imply that), because it allows any calculation that was possible before. It's even compatible with the base map.

comment by limerott · 2019-07-17T06:17:38.988Z · score: 1 (1 votes) · LW · GW

Ah, I see. So vectors are treated like abstract objects and representing them in a matrix-like form is an additional step. And instead of coordinate vectors, which may be confusing, you only work with matrices. I can imagine that this is a useful perspective when you work with many different bases. Thank you for sharing it.

Would you then agree to define where is the standard basis?

comment by sil ver (sil-ver) · 2019-07-17T18:59:06.996Z · score: 1 (1 votes) · LW · GW

I wouldn't be heartbroken if it was defined like that, but I wouldn't do it if I were writing a textbook myself. I think the LADR approach makes the most sense – vectors and matrices are fundamentally different – and if you want to bring a vector into the matfrix world, then why not demand that you do it explicitly?

If you actually use LA in practice, there is nothing stopping you from writing . You can be 'sloppy' in practice if you know what you're doing while thinking that drawing this distinction is a good idea in a theoretical text book.