Interpreting a matrix-valued word embedding with a mathematically proven characterization of all optima
post by Joseph Van Name (joseph-van-name) · 2023-09-04T16:19:24.401Z · LW · GW · 4 commentsContents
Why MPO word embeddings? Quaternionic matrices: MPO word embeddings Some thoughts: Why use the spectral radius? Why complex numbers and quaternions? Disadvantages of MPO word pre-embeddings: Conclusion: Another possible advantages of quaternions: Added 9/8/2023 None 5 comments
In this post, I shall first describe a new word embedding algorithm that I came up with called a matrix product optimized (MPO) word embedding, and I will prove a theorem that completely interprets this word embedding in the simplest case. While it is probably infeasible to mathematically characterize a word embedding with a mathematical proof when the corpus is something that one encounters in practice, this theorem should be a signal that such a word embedding (or a similar word embedding) should be interpretable and mathematical in other ways as well. This theorem also illustrates the way that MPO word embeddings should behave.
Unlike most word embedding algorithms, MPO word embeddings are matrix-valued so that they map tokens to matrices instead of simply mapping tokens to vectors. In our case, the matrices are not necessarily real matrices as they may be complex or even quaternionic matrices. MPO word embeddings also differ from other word embedding algorithms in that they are not constructed using neural networks though we still use gradient ascent.
Why MPO word embeddings?
Since tokens often have many meanings depending on context, it seems better to represent a token in a form where it is easy or easier to separate the individual meanings of a token. While vectors may be good for representing individual meanings of tokens, it is better to represent a polysemantic token as a matrix instead of a vector. If someone were to give me a task of interpreting a word embedding, I would be much happier if the word embedding were a matrix-valued word embedding that neatly organized each of the meanings of a polysemantic token into a matrix than if the word embedding were a vector-valued word embedding where each of the individual meanings of the token were awkwardly smushed together in a vector.
Spaces of matrices have additional structure that is lacking in vector spaces, and one can use this additional structure to analyze or interpret our word embedding. This additional structure also means that matrix-valued word embeddings should behave more mathematically than vector-valued word embeddings.
MPO word embeddings also satisfy some interesting properties. Let denote the fitness level of a random MPO word embeddings that we obtain from the set where consists of the corpus that we are training our word embedding on along with a couple of hyperparameters. Then the random variable will often have very low entropy or even zero entropy. Since often has zero/low entropy, trained MPO word embeddings will not contain any/much random information that was not already present in the training data. It will be easier to interpret machine learning models when the trained model only depends on the training data and hyperparameters and which does not depend on random choices such as the initialization. Quaternionic and complex MPO word embeddings will often become real word embeddings after a change of basis. This means that MPO word embeddings behave quite mathematically, and it seems like machine learning models that behave in ways that mathematicians like would be more interpretable and understandable than other machine learning models.
Quaternionic matrices:
In this section, we shall go over the basics of quaternionic matrices. I assume that the reader is already familiar with ideas in linear algebra up through the singular value decomposition. Much of the basic theory of quaternionic matrices is a straightforward generalization of the theory of complex matrices. I hope you are also familiar with the quaternions, but if you are not, then I will remind you the definition and basic facts about quaternions. We refer the reader to [1] for facts about quaternions. You may skip this section if you simply care about the real and complex cases.
If is a square real, complex, or quaternionic matrix, then the spectral radius of is
where is any matrix norm (since we are taking the limit, it does not matter which matrix norm we choose). If is a real or complex matrix, then
. The same fact holds for quaternionic matrices, but we have to be careful about how we define the eigenvalues of quaternionic matrices due to non-commutativity.
The division ring of quaternions is the 4-dimensional associative but non-commutative algebra over the field of real numbers spanned by elements (so ) where quaternionic multiplication is the unique associative (but non-commutative) bilinear operation such that . We observe that if , then . It is easy to show that .
Recall that the conjugate and the absolute value of a quaternion are defined by
and whenever . If , then and . Observe that the field of complex numbers is a subring of .
If is a quaternionic matrix, then define the adjoint of to be . While the adjoint of a quaternionic matrix is well-behaved, the transpose of a quaternionic matrix is not very well behaved since we typically have but for quaternionic matrices .
We shall now associate -quaternionic matrices with -complex matrices.
Suppose that are -complex matrices. Then the associated quaternionic complex matrix of the quaternionic matrix is the complex matrix
. We observe that , and , and whenever are quaternionic matrices and the operations are defined.
An eigenvalue of a quaternionic matrix is a quaternion such that for some non-zero vector .
Observation: If is an eigenvalue of a quaternionic matrix corresponding to the eigenvector and is an eigenvalue of a quaternionic matrix corresponding to the same eigenvector , then , so is an eigenvector of the quaternionic matrix .
Observation: If is an eigenvalue of a product of quaternionic matrices and, then , so if , then is also an eigenvalue of . In particular, and have the same non-zero eigenvalues.
Observation: If is a quaternionic matrix and is an eigenvalue of , then whenever is a non-zero quaternion, the value is also an eigenvalue of .
Proof: If is an eigenvalue of , then there is some non-zero quaternionic vector with . Therefore, set . Then , so is an eigenvalue of the quaternionic matrix .
Let denote the group of all quaternions with . Let denote the group of all -unitary matrices with determinant . Let denote the group of all - orthogonal matrices. Then the mapping is an isomorphism from to . Let be the vector subspace of consisting of all elements of the form where are real numbers. Then is an inner product space where inner product is the standard dot product operation on where and where for quaternions ). Then we can identify with the linear transformations that preserve this inner product. We may now define a group homomorphism by letting . Then the homomorphism is a surjective homomorphism. In fact, is a connected Lie group with , and the group is simply connected; is the universal cover of the lie group . From these facts about quaternions, we may make a few observations:
Observation: If are quaternions, then there is some with if and only if and .
Observation: Suppose that is a square quaternionic matrix and are quaternions with and . Then is an eigenvalue of if and only if is an eigenvalue of .
Observation: Suppose that is a quaternionic vector, and are complex vectors with . Suppose furthermore that is a complex number. Then if and only if . In particular, is an eigenvalue of if and only if is an eigenvalue of .
Observation: Let be a square quaternionic matrix. Then the following quantities are equivalent:
- .
We shall therefore define the spectral radius of a square quaternionic matrix to be (or any of the quantities in the above observation).
If is a quaternionic matrix, then we say that is Hermitian, normal, or unitary respectively if is Hermitian, normal, or unitary. If is an -quaternionic matrix, then is Hermitian iff , and is normal iff , and is unitary iff.
If are quaternionic column vectors, then define the quaternionic inner product of by . We observe that the quaternionic inner product is real bilinear and conjugate symmetric ,. The quaternionic inner product preserves right scalar multiplication . We define the real inner product of two quaternionic vectors by setting. We may recover the quaternionic inner product from the real-valued inner product since for quaternionic vectors .
Observation: If are quaternionic vectors, then and for .
Observation: Suppose are quaternionic vectors. Then .
Observation: Suppose are non-zero quaternionic vectors. If , then for some quaternion .
Counterexample: In general, if is a non-zero quaternion, and is a quaternionic vector, then .
Observation: Let be an quaternionic matrix, and let be an -quaternionic matrix. Then the following are equivalent:
- for all quaternionic vectors .
- for all quaternionic vectors .
Proposition: Let be a quaternionic matrix. Then the following are equivalent:
- for all quaternionic vectors .
- for all quaternionic vectors .
- for all quaternionic vectors .
- is an isometry.
If is a square quaternionic matrix, then the above statements are all equivalent to the following:
6. is unitary.
Theorem: (quaternionic singular value decomposition) If is a quaternionic matrix, then there exists quaternionic unitary matrices where is a diagonal matrix with non-negative real non-increasing diagonal entries.
In the above theorem, the diagonal entries in are called the singular values values of and they are denoted by where .
If is a real, complex, or quaternionic matrix, and then define the Schatten -norm of to be where we define for and .
Observation: If is a quaternionic matrix with singular values , then has singular values . In particular, and , so whenever .
Observation: If is a quaternionic matrix, then .
If is an -quaternionic matrix, then define the quaternionic trace of as
. In general, , so the notion of the quaternionic trace is deficient.
If is an -quaternionic matrix, then define the real trace of as We observe that and whenever are quaternionic matrices.
Observe that we can recover from by using the formula .
If are -quaternionic matrices, then define the real-valued Frobenius inner product as .
If and where every is real, then
. Define .
We observe that if are unitary, then
.
In particular, if where are unitary and is diagonal with diagonal entries , then .
MPO word embeddings
Let denote either the field of real numbers, complex numbers, or the division ring of quaternions. Let be a finite set (in our case, will denote the set of all tokens). Let be the set of all functions such that . Observe that is compact. Define a function by letting . We say that is an MPO word pre-embedding for the string of tokens if the quantity is locally maximized. To maximize , the function must simultaneously satisfy two properties. Since , the matrices must be spread out throughout all the dimensions. But we also need for the matrices to be compatible with and and more generally with for near .
We say that a collection of vectors is a tight frame if for some necessarily positive constant . We say that a tight frame is an equal norm tight frame if .
Theorem: Suppose that . Let denote either the field of real numbers or complex numbers. Let denote the unit sphere in . Then the local minimizers of the frame potential defined by are global minimizers which are tight frames for . In particular, there exists an equal norm tight frame whenever .
See [2, Thm 6.9] for a proof.
Theorem: Suppose that is the field of real numbers, complex numbers, or the division ring of quaternions. Let . Then and if and only if there is an equal norm tight frame with and with where for all (where addition in the subscripts is taken modulo ).
Proof: In order to not repeat ourselves, our proof shall apply to all three cases: , but for accessibility purposes, the proof shall be understandable to those are only familiar with real and/or complex matrices.
Suppose that is an equal norm tight frame with , are elements with and for all . Since and , we have . Therefore,
Therefore, .
Now,
. Therefore, we have
.
Suppose now that . By the arithmetic-geometric mean inequality, we have
Suppose now that . We observe that precisely when . From the arithmetic-geometric mean inequality, we have precisely when . Therefore, each has rank at most and for , so we can set where and.
Observe that . Therefore,
, so
for all .
Therefore, there are quaternions with and where for all .
We observe that
.
Therefore, is an equal norm tight frame.
Some thoughts:
It seems easier to prove theorems about MPO word pre-embeddings than it is to prove theorems about other machine learning models simply because MPO word pre-embeddings are more mathematical in nature. Of course, neural networks with ReLU activation are mathematical too, so we can prove theorems about them, but MPO word pre-embeddings are more like the objects that mathematicians like to investigate. And it seems easier to mathematically prove theorems that interpret MPO word pre-embeddings than it is to prove theorems that interpret other machine learning models. On the other hand, we are making a tradeoff here. MPO word pre-embeddings behave more mathematically, but word embeddings are simply the first layer in natural language processing.
Why use the spectral radius?
The matrix is in general approximately a rank-1 matrix. This means that whenever are unit vectors. One may be tempted to define the fitness of the as or . But mathematical objects are better behaved when taking limits. Instead of considering the string as our corpus, we may consider the similar corpus as , and in this case, the fitness of would be or which will both converge to as . The use of the spectral radius simplifies and improves the behavior of the machine learning model for the same reason that integrals such as behave better than sums .
Why complex numbers and quaternions?
While it takes more time to compute MPO word pre-embeddings for the complex numbers and for the quaternions, the complex and quaternionic MPO word pre-embeddings have some advantages over real MPO word pre-embeddings. It is currently unclear as to whether the advantages of complex and quaternionic MPO word pre-embeddings outweigh the cost of the increased computational complexity of training and using complex or quaternionic word pre-embeddings. Further research is needed on this topic.
Complex and quaternionic matrices provide new ways of testing MPO word pre-embeddings which are not available if we simply used real matrices. For example, if is a complex or quaternionic MPO word pre-embedding, then the existence of a unitary matrix where is a real matrix should be considered evidence that the MPO word pre-embedding is a high quality machine learning model while the non-existence of .
A complex or quaternionic MPO word pre-embedding will typically have a higher fitness level than a corresponding real MPO word pre-embedding.
Sometimes, when one trains an MPO word pre-embedding twice to obtain two MPO word pre-embeddings , the fitness levels are equal (), and it is desirable to have equal fitness levels when training the word pre-embedding multiple times. But the probability of obtaining the equality will depend on the choice of . In many cases, we would have
for . Of course, this probability depends on other factors besides the choice of division ring such as the initialization. For example, if the matrices of the form are initialized to have random positive values, then the probability would be much greater than if the matrices of the form are initialized to have random real values; if each initially has random positive values, then I would currently consider the real-valued MPO pre-embeddings to be about as good as the complex-valued MPO pre-embeddings but easier to compute.
The fitness function is not differentiable everywhere, and the gradient of has singularities. The singularities of have real codimension . In the case when the singularities of the gradient of disconnect the domain , and gradient ascent has difficulty crossing these singularities to reach a good local maximum.
Disadvantages of MPO word pre-embeddings:
- Projectivity: If is a ring, then the center of is the collection of all elements where for all . It is easy to show that is always a subring of . We observe that . If , for , and for , then is an MPO word pre-embedding if and only if is an MPO word pre-embedding. This means that after one trains an MPO word pre-embedding , one needs to find the constants where the function defined by behaves optimally. I may talk about how we can find the constants in another post.
- Locality: An MPO word pre-embedding is good at relating tokens to their immediate surroundings in the following sense. Suppose is an MPO word pre-embedding for the string . Then the value of will be determined mostly by all the other values for together with the multiset of all strings such that . In other words, can see the immediate surroundings of , but will not be able to see much more than this.
- Difficulty utilizing all dimensions without noise: Sometimes MPO word pre-embeddings behave poorly because it will be difficult to maximize the spectral radius while we still have . To ameliorate this problem, we can relax the requirement for to the condition for some . But in this case, the MPO word pre-embedding will not use all of the dimensions in evenly.
Conclusion:
Spectral methods have already been quite valuable in machine learning for a good reason. I will continue to make more posts about using the spectral radius (and similar objects) to construct machine learning models.
Another possible advantages of quaternions: Added 9/8/2023
It seems like an increase in the value provides diminishing returns in the performance of MPO word embeddings since when is large, MPO word embeddings have difficulty utilizing all dimensions equally. MPO word embeddings therefore can only give us some information about a token. However, it seems like increasing the index increases the number of parameters of MPO word embeddings without the word embedding encountering substantially more difficulty in utilizing all dimensions significantly. Therefore, MPO word embeddings should perform better simply by increasing the index . This means that complex MPO word embeddings should perform better than real MPO word embeddings, and quaternionic MPO word embeddings should perform better than complex MPO word embeddings. On the other hand, we still need to perform experiments to determine whether complex MPO word embeddings are that much better than real MPO word embeddings and quaternionic MPO word embeddings are even better than both real and complex MPO word embeddings.
References:
- Quaternions and matrices of quaternions. Fuzhen Zhang. Linear Algebra and its Applications. Volume 251, 15 January 1997, Pages 21-57
- An Introduction to Finite Tight Frames. Shayne F. D. Waldron. July 26, 2017
4 comments
Comments sorted by top scores.
comment by the gears to ascension (lahwran) · 2024-01-05T22:57:17.304Z · LW(p) · GW(p)
this does look like it might be interesting, but I think you might need to show - possibly visually - why this works. what does an embedding of a 5-word dataset look like on a chart? how does one interpret it on that chart? why do these expressions map to that chart, etc? that would allow introducing the math you're using to anyone who doesn't know it or is rusty, while reducing effort to follow it for those who don't know it. If you're only intending to communicate to those who already get the prereqs, which is a thing people do, then, well, post more posts like this and I'm sure someone who has the particular math background (quaternions?) will run into them eventually. it looks like the math is all here, just too dense for my current level of motivation, so maybe you just need the right eyes.
I personally can't evaluate your idea within the time I would allot to reading this post because you use a lot of expressions I'm not immediately familiar with and I don't see a way to shortcut through them in order to draw the conclusions about improved interpretability you're implying. but it does seem conceivable that it could be pretty cool. I can imagine why having explicitly disambiguated word senses would be useful, if you could get your embedding to be sturdy about them.
Replies from: joseph-van-name↑ comment by Joseph Van Name (joseph-van-name) · 2024-01-07T14:58:16.780Z · LW(p) · GW(p)
I appreciate your input. I plan on making more posts like this one with a similar level of technical depth. Since I included a proof with this post, this post contained a bit more mathematics than usual. With that being said, others have stated that I should be aware of the mathematical prerequisites for posts like this, so I will keep the mathematical prerequisites in mind.
Here are some more technical thoughts about this.
- We would all agree that the problem of machine learning interpretability is a quite difficult problem; I believe that the solution to the interpretability problem requires us not only to use better interpretability tools, but the machine learning models themselves need to be more inherently interpretable. MPO word embeddings and similar constructions have a little bit (but not too much) of difficulty since one needs to get used to different notions. For example, if we use neural networks using ReLU activation (or something like that), then one has less difficulty upfront, but when it comes time to interpret such a network, the difficulty in interpretability will increase since neural networks with ReLU activation do not seem to have the right interpretability properties, so I hesitate to interpret neural networks. And even if we do decide to interpret neural networks, the interpretability tools that we use may have a more complicated design than the networks themselves.
- There are some good reasons why complex numbers and quaternions have relatively little importance in machine learning. And these reasons do not apply to constructions like MPO word embeddings.
- Since equal norm tight frames are local minimizers of the frame potential, it would help to have a good understanding of the frame potential. For simplicity, it is a good idea to only look at the real case. The frame potential is a potential for a force between a collection of particles on the sphere where particles are repelled from each other (and from each other's antipodal point) and where the force tries to make all the particles orthogonal to each other. If , then it is possible to make all of the particles orthogonal to each other, and in this case, when we minimize this potential, the equal norm tight frames will simply be orthonormal bases. In the case when , we cannot make all of the particles orthogonal to each other, but we can try to get as close as possible. Observe that unlike the Newtonian and logarithmic potential, the frame potential does not have a singularity for when the two particles over lap. I will leave it to you to take the gradient (at least in the real case) of the frame potential to see exactly what this force does to the particles.
- Training an MPO word embedding with the complex numbers of quaternions is actually easier in the sense that for real MPO word embeddings, one needs to use a proper initialization, but with complex and quaternionic MPO word embeddings, an improper initialization will only result in minor deficiencies in the MPO word embedding. This means that the quaternions and complex numbers are easier to work with for MPO word embeddings than the real numbers. In hindsight, the solution to the problem of real MPO word embeddings is obvious, but at the time, I thought that I must use complex or quaternionic matrices.
- I like the idea of making animations, but even in the real case where things are easy to visualize, the equal norm tight frames are non-unique and they may involve many dimensions. The non-uniqueness will make it impossible to interpret the equal norm tight frames; for the same reason, it is hard to interpret what is happening with neural networks since if you retrain a neural network with a different initialization or learning rate, you will end up with a different trained network, but MPO word embeddings have much more uniqueness properties that make them easier to interpret. I have made plenty of machine learning training animations and have posted these animations on YouTube and TikTok, but it seems like in most cases, the animation still needs to be accompanied by technical details; with just an animation, the viewers can see that something is happening with the machine learning model, but they need both the animation and technical details to interpret what exactly is happening. I am afraid that most viewers just stick with the animations without going into so many technical details. I therefore try to make the animations more satisfying than informative most of the time.
comment by Joseph Van Name (joseph-van-name) · 2023-09-04T18:04:07.402Z · LW(p) · GW(p)
Is the Latex compiling here?
Replies from: joseph-van-name↑ comment by Joseph Van Name (joseph-van-name) · 2023-09-05T15:08:05.382Z · LW(p) · GW(p)
I made the Latex compile by adding a space. Let me know if there are any problems.