# Linear Algebra Done Right, Axler

post by David Udell · 2023-01-02T22:54:58.724Z · LW · GW · 6 comments

## Contents

  Contents and Notes
1. Vector Spaces
2. Finite-Dimensional Vector Spaces
3. Linear Maps
The Matrix of a Linear Map
4. Polynomials
5. Eigenvalues and Eigenvectors
Polynomials Applied to Operators
Upper-Triangular Matrices
Diagonal Matrices
6. Inner-Product Spaces
7. Operators on Inner-Product Spaces
8. Operators on Complex Vector Spaces
9. Operators on Real Vector Spaces
10. Trace and Determinant
None


Epistemic status: A brisk walkthrough of (what I take to be) the highlights of this book's contents.

The big one for mathematically understanding ML!

The idea responsible for getting me excited about linear algebra is:

Linear maps are the homomorphisms between vector spaces.

Linear algebra is about the tripartite relationship between (1) homomorphisms between vector spaces, (2) sets of equations, and (3) grids of numbers.

However, grids of numbers ('matrices'), the usual star of the show in a presentation of linear algebra, aren't foregrounded in this book. Instead, this is a book chiefly treating the homomorphisms ('linear maps') themselves, directly.

# Contents and Notes

## 1. Vector Spaces

Vector spaces are fairly substantial mathematical structures, if you're pivoting out of thinking about set theory! Intuitively, a vector space is a space  for which (1) ray addition and (2) scaling rays (emanating from the origin out to points) are both nicely defined.

Precisely, a vector space is a set  defined over a field  in which

1. V is closed under vector addition, and vector addition is commutative, associative, there is an additive identity , and there is an additive inverse for every vector ;
2. V is closed under scalar multiplication, scalar multiplication is associative, and there is a multiplicative identity ;
3. and vector addition and scalar multiplication are connected by distribution such that, for all  and ,

A subspace  of a vector space  is any subset  that is still itself a vector space, under the same two operations of . Vector spaces can be decomposed into their subspaces, where you think of adding vectors drawn the different subspace via their common addition operation.

## 2. Finite-Dimensional Vector Spaces

You live at the origin of , and your tools are the vectors that emanate out from your home. Because we have both vector addition and scalar multiplication, we have two ways of extending (or shortening) any single vector out from the origin arbitrarily far. If we're interested in reaching points in , one immediate way to get to points we didn't have a vector directly to... is by extending a too-short vector pointed in the right direction! Furthermore, because we can always multiply a vector by  to reverse its direction, both the exactly right and exactly wrong directions will suffice to reach out and touch a point in .

We can also use vector addition to add two vectors pointing off in differing directions (directions which aren't exact opposites). If we have vectors , and , we have all the tools we need to produce any vector in ! The awkward lengths of all the vectors are irrelevant, because we can scale all of them arbitrarily. We use some amount of vertical, horizonal, and -dimensional displacement to get to anywhere via addition and multiplication! More formally, we say that the set  spans .

Intuitively, a minimal spanning set is called a basis for a vector space.  is a basis for the vector space , because none of the vectors are "redundant": you could not produce every vector in  without all three elements in . If you added any further vector to that spanning set, though, the set would now have a redundant vector, as  is already spanned. The set would no longer be a minimal spanning set in this sense, and so would cease to be a basis for .

Every finite-dimensional, nonzero vector space containing infinitely many vectors has infinitely many bases (pp. 29-32). Each basis for an -dimensional vector space is a set containing  vectors, where each vector is an ordered set containing  numbers drawn from  (p. 32).

## 3. Linear Maps

Intuitively, a linear map is a function that translates addition and multiplication between two vector spaces.

Formally, a linear map  is a function from a vector space  to a vector space  (taking vectors and returning vectors) such that

for all ; all ; and all . Note that both are homomorphism properties: one for addition across vector spaces and one for multiplication across vector spaces! We'll call the former relationship additivity, the latter, homogeneity.

The symbol  stands for the set of all the linear maps from  to .

Some example linear maps (pp. 38-9) include:

When the vector spaces are specifically the set of all real-valued polynomials :

translating between  and .

As linear maps are functions, they can be composed when they have matching domains and co-domains, giving us our notion of products between linear maps.

The kernel of a linear map  is the subset  containing all and only the vectors  that  maps to . Note that linear maps can only "get rid" of vectors by shrinking them down all the way, i.e., by sending them to . If a function between vector spaces simply sent everything to a nonzero vector, it would violate the linear map axioms! All kernels are subspaces of  (p. 42). A linear map is injective whenever  (p. 43).

The image  of  is the subset of  covered by some . All images are subspaces of  (p. 44). A linear map is obviously surjective whenever .

### The Matrix of a Linear Map

A matrix  is an array of numbers, with  rows and  columns:

(Matrices are a generalization of vectors into the horizontal dimension, and vectors can be thought of as skinny -by- matrices.)

Let . Suppose that  is a basis of  and  is a basis of . For each , we can write  uniquely as a linear combination of the 's:

where  for . The scalars  completely determine the linear map  because a linear map is determined by its values on a basis. The -by- matrix  formed by the 's is called the matrix of  with respect to the bases  and ; we denote it by

If you think of elements of  as columns of  numbers, then you can think of the th column of  as  applied to the th basis vector (pp. 48-9; notation converted to our own.)

The vector , with matrix multiplication on the right side of the equation (pp. 53-4).

## 5. Eigenvalues and Eigenvectors

We now begin our study of operator theory!

Any vector space  we discuss from here on out will be neither the zero vector space  nor an infinite-dimensional vector space.

Operators are linear maps from  to itself. Notationally, .

We call a subspace  invariant under  if, for all ,

Let  now specifically be a one-dimensional subspace of  such that, fixing any nonzero ,

Then  is a one-dimensional subspace of , and every one-dimensional subspace of  is of this form. If  and the subspace  defined by  is invariant under , then  must be in , and hence there must be a scalar  such that . Conversely, if  is a nonzero vector in  such that  for some , then the subspace S defined by  is a one-dimensional subspace of  invariant under  (p. 77; notation converted).

In the above equation , the scalar  is called an eigenvalue of , and the corresponding vector  is called an eigenvector of .

Because  is equivalent to , we see that the set of eigenvectors of  corresponding to  equals . In particular, the set of eigenvectors of  corresponding to  is a subspace of .

[For example,] if , then  has only one eigenvalue, namely, , and every vector is an eigenvector for this eigenvalue (p. 77-8; notation converted).

### Polynomials Applied to Operators

The main reason that a richer theory exists for operators… than for linear maps is that operators can be raised to powers (p. 80).

An operator raised to a power  is just that operator composed with itself  times.

Because we have a notion of functional products, functional sums, and now operators raised to powers, we can now construct arbitrary polynomials with operators as the variables!

### Upper-Triangular Matrices

A square matrix is an  matrix.

An upper-triangular matrix is a square matrix for which all entries under the principal diagonal equal .

Every operator on a finite-dimensional, nonzero, complex vector space has an eigenvalue (p. 81).

Suppose  is a complex vector space and . Then  has an upper-triangular matrix with respect to some basis of  (p. 84; notation converted).

Suppose  has an upper-triangular matrix with respect to some basis of . Then the eigenvalues of  consist precisely of the entries on the diagonal of that upper-triangular matrix (p. 86; notation converted).

### Diagonal Matrices

A diagonal matrix is a square matrix for which all entries off the principal diagonal equal .

If  has  distinct eigenvalues, then  has a diagonal matrix with respect to some basis of  (p. 88; notation converted).

## 6. Inner-Product Spaces

For , the dot product of  and , denoted by , is defined by

where  is the th entry in , and similarly for  and  (p. 98; notation converted).

Inner products are just a generalization of dot products to arbitrary vector spaces . (With some finagling, both dot products and inner products generally can be interpreted as linear maps.) An inner-product space is an ordered set containing a vector space  and an inner product on it.

Intuitively, the norm of a vector is the length of that vector, interpreted as a ray, from the origin to its tip. More formally, the norm of a vector  in an inner-product space is defined to be the square root of the inner product of that vector  with itself:

Note that this looks just like , the Pythagorean theorem for the sides  of a right triangle in Euclidian space. That's because other inner products on other vector spaces are meant to allow for a generalization of the Pythagorean theorem in those vector spaces!

Intuitively, two vectors are orthogonal when they're perpendicular. Formally, two vectors are called orthogonal when their inner product is . With the opposite and adjacent sides  of the right unit triangle in the vector space ,

"It's all just right triangles, dude."

## 7. Operators on Inner-Product Spaces

Complex Spectral Theorem: Suppose that  is a complex inner-product space and . Then  has an orthonormal basis consisting of eigenvectors of  if and only if  is normal (p. 133; notation converted).

Real Spectral Theorem: Suppose that  is a real inner-product space and . Then  has an orthonormal basis consisting of eigenvectors of  if and only if  is self-adjoint (p. 136; notation converted).

In other words, to multiply together two block diagonal matrices (with the same size blocks), just multiply together the corresponding entries on the diagonal, as with diagonal matrices.

A diagonal matrix is a special case of a block diagonal matrix where each block has size -by-. At the other extreme, every square matrix is a block diagonal matrix because we can take the first (and only) block to be the entire matrix... The smaller the blocks, the nicer the operator (in the vague sense that the matrix then contains more 's). The nicest situation is to have an orthonormal basis that gives a diagonal matrix (p. 143).

The singular values of  are the eigenvalues of , where each eigenvalue  is repeated  times (p. 155).

Every operator on  has a diagonal matrix with respect to some orthonormal bases of , provided that we are permitted to use two different bases rather than a single basis as customary when working with operators (p. 157).

## 9. Operators on Real Vector Spaces

We have defined eigenvalues of operators; now we need to extend that notion to square matrices. Suppose  is an -by- matrix with entries in . A number  is called an eigenvalue of  if there exists a nonzero -by- matrix  such that

For example,  is an eigenvalue of  because

(p. 194).

Suppose  and  is the matrix of  with respect to some basis of . Then the eigenvalues of  are the same as the eigenvalues of  (p. 194; notation converted).

Cayley-Hamilton Theorem: Suppose  is a real vector space and . Let  denote the characteristic polynomial of . Then  (p. 207; notation converted).

The Cayley-Hamilton theorem also holds on complex vector spaces generally (p. 173).

## 10. Trace and Determinant

The matrix of an operator  depends on a choice of basis of . Two different bases of  may give different matrices of  (p. 214; notation converted).

Intuitively, the determinant of an operator  is the change in volume  effects. The determinant is negative when the operator flips all the vectors "inverts the volume" it works on.

If  is a complex vector space, then  equals the product of the eigenvalues of , counting multiplicity... Recall that if  is a complex vector space, then there is a basis of  with respect to which  has an upper-triangular matrix... thus  equals the product of the diagonal entries of the matrix (p. 222; notation converted).

1. ^

Intuitively, a a homomorphism is a function showing how the operation of vector addition can be translated from one vector space into another and back.

More precisely, a homomorphism is a function (here, from a vector space  to a vector space ) such that

with  and .

The vector addition symbol  on the left side of the equality, inside the function, is defined in , and the addition symbol  on the right side of the equality, between the function values, is defined in .

2. ^

Vectors can be interpreted geometrically as rays from the origin out to points in a space. Vectors can also be understood algebraically as ordered sets of numbers (with each number representing a coordinate over in the ray interpretation).

As far as notation goes, we'll use variables with arrows  for vectors, lowercase variables  for numbers, and capital variables  for other larger mathematical structures, such as vector spaces.

3. ^

In this book, that field  will be either the reals  or the complexes .

4. ^

Take note of how homomorphism-ish the below distributive relationships are!

5. ^

Vectors are conventionally written vertically. But each vector  has a transpose , where the vector is written out horizontally instead.

So we'll use vector transposes to stay in line with conventional notation while not writing out those giant vertical vectors everywhere.

6. ^

One deep idea out of mathematics is that the dimensionality of a system is just the number of variables in that system that can vary independently of every other variable. You live in -dimensional space because you can vary your horizontal, vertical, and -dimensional position without necessarily changing your position in the other two spatial dimensions by doing so.

7. ^

Note that the set , where  is a vector containing only  any number  of times, satisfies the vector space axioms!

establishes closure under addition, existence of an additive identity, existence of an additive inverse for all vectors, additive commutativity, and additive associativity. Letting the field be the reals with

establishes closure under multiplication, multiplicative associativity, and the existence of a multiplicative identity. Finally,

establishes distributivity.

Any such vector space  has just one basis, . Intuitively, since you live at the origin, the origin is already spanned by no vectors at all -- i.e., the empty set of vectors. Any additional vector would be redundant, so no other sets constitute bases for .

8. ^

In math, the bigger and/or fancier the symbol, the bigger the set or class that symbol usually stands for.

9. ^

A vector  can stand for a polynomial by containing all the coefficients in the polynomial, coefficients ordered by the degree of each coefficient's monomial.

10. ^

This is addition of functions, , on the left side of the equation.  is the identity function.

11. ^

is the dimension of , formalized as the number of vectors in any basis of .

12. ^

Intuitively, orthonormal sets are nice sets of vectors like , where each vector has length one and is pointing out in a separate dimension.

More precisely, a set of vectors is called orthonormal when its elements are pairwise orthogonal and each vector has a norm of . We will especially care about orthonormal bases, like the set above with respect to .

13. ^

The adjoint of a linear map  is a linear map  such that the inner product of  and  equals the inner product of  and  for all  and .

Remember that inner products aren't generally commutative, so the order of arguments matters. Adjoints feel very anticommutative.

An operator  on an inner-product space  is called normal when

14. ^

An operator  is self-adjoint when .

15. ^

A block diagonal matrix is a square matrix of the form

where  are square matrices lying along the diagonal and all the other entries of the matrix equal  (p. 142).

16. ^

Suppose  is a complex vector space and . Let  denote the distinct eigenvalues of . Let  denote the multiplicity of  as an eigenvalue of . The polynomial

is called the characteristic polynomial of . Note that the degree of the characteristic polynomial of  equals … the roots of the characteristic polynomial of  equal the eigenvalues of  (p. 172; notation converted).

Characteristic polynomials can also be defined for real vector spaces, though the reals are a little less well behaved as vector spaces than the complexes.

Suppose  is a real vector space and . With respect to some basis of  has a block upper-triangular matrix [any entries acceptable above ] of the form

where each  is a -by- or a -by- matrix with no eigenvalues. We define the characteristic polynomial of  to be the product of the characteristic polynomials of . Explicitly, for each , define  by

Then the characteristic polynomial of  is

Clearly the characteristic polynomial of  has degree … The characteristic polynomial of  depends only on  and not on the choice of a particular basis (p. 206; notation converted).

comment by Erik Jenner (ejenner) · 2023-01-03T21:53:13.411Z · LW(p) · GW(p)

Then the eigenvectors of  consist precisely of the entries on the diagonal of that upper-triangular matrix

I think this is a typo and should be "eigenvalues" instead of "eigenvectors"?

The determinant is negative when the operator flips all the vectors it works on.

This could be misleading. E.g. the operator f(v) := -v that literally just flips all vectors has determinant (-1)^n, where n is the dimension of the space it's working on. The sign of the determinant tells you whether an operator flips the orientation of volumes, it can't tell you anything about what it does to individual vectors.

(Regarding "orientation of volumes": in the 2D case, think of R^2 as a sheet of paper, then f(v) := -v is just a 180 degree rotation, so the same side stays up, and the determinant is positive. In contrast, flipping along an axis requires turning over the paper, so negative determinant. Unfortunately this can't really be visualized the same way in 3D, so then you have to think about ordered bases.)

Replies from: David Udell
comment by David Udell · 2023-01-04T01:27:20.630Z · LW(p) · GW(p)

Thanks -- right on both counts! Post amended.

comment by Oliver Sourbut · 2023-01-05T11:17:41.199Z · LW(p) · GW(p)

This is great! I'll thread a few nits under this comment

Replies from: Oliver Sourbut, Oliver Sourbut
comment by Oliver Sourbut · 2023-01-05T11:22:21.218Z · LW(p) · GW(p)

We call a subspace  invariant under  if, for all ,

comment by Oliver Sourbut · 2023-01-05T11:19:14.631Z · LW(p) · GW(p)

Let  now specifically be a one-dimensional subspace of  such that, for all ,

I think such can not exist in most cases, and it should instead read '... for some ...'

The expression for is describing the span of the vector , so certainly if is more than one-dimensional, if some subspace has this property for all then it has this property for linearly independent vectors in , which is a contradiction.

comment by Gurkenglas · 2023-01-11T20:10:58.987Z · LW(p) · GW(p)

The definition of matrix ("the basis maps to:") ought to come after the "uniquely determines the linear map" that justifies it.

For interpreting v as a slim matrix, I would use bra-ket notation: |v> for the function of type V <- R, <v| for the function whose type is the dual R <- V. Then <v|v> has type R <- R (and corresponds to multiplication by a scalar) and |v><v| has type V <- V.

An inner product just maps |v> to <v|. (Though I don't quite see what the symmetry is for.)

Mapping a point cloud through a linear map thins it by a factor of the determinant; this generalizes to smooth maps, since they are locally linear.