Interpreting a matrix-valued word embedding with a mathematically proven characterization of all optima

joseph-van-name

Interpreting a matrix-valued word embedding with a mathematically proven characterization of all optima

post by Joseph Van Name (joseph-van-name) · 2023-09-04T16:19:24.401Z · LW · GW · 4 comments

  Why MPO word embeddings?
  Quaternionic matrices:
  MPO word embeddings
  Some thoughts:
  Why use the spectral radius?
  Why complex numbers and quaternions?
  Disadvantages of MPO word pre-embeddings:
  Conclusion:
  Another possible advantages of quaternions: Added 9/8/2023
None
5 comments

In this post, I shall first describe a new word embedding algorithm that I came up with called a matrix product optimized (MPO) word embedding, and I will prove a theorem that completely interprets this word embedding in the simplest case. While it is probably infeasible to mathematically characterize a word embedding with a mathematical proof when the corpus is something that one encounters in practice, this theorem should be a signal that such a word embedding (or a similar word embedding) should be interpretable and mathematical in other ways as well. This theorem also illustrates the way that MPO word embeddings should behave.

Unlike most word embedding algorithms, MPO word embeddings are matrix-valued so that they map tokens to matrices instead of simply mapping tokens to vectors. In our case, the matrices are not necessarily real matrices as they may be complex or even quaternionic matrices. MPO word embeddings also differ from other word embedding algorithms in that they are not constructed using neural networks though we still use gradient ascent.

Why MPO word embeddings?

Since tokens often have many meanings depending on context, it seems better to represent a token in a form where it is easy or easier to separate the individual meanings of a token. While vectors may be good for representing individual meanings of tokens, it is better to represent a polysemantic token as a matrix instead of a vector. If someone were to give me a task of interpreting a word embedding, I would be much happier if the word embedding were a matrix-valued word embedding that neatly organized each of the meanings of a polysemantic token into a matrix than if the word embedding were a vector-valued word embedding where each of the individual meanings of the token were awkwardly smushed together in a vector.

Spaces of matrices have additional structure that is lacking in vector spaces, and one can use this additional structure to analyze or interpret our word embedding. This additional structure also means that matrix-valued word embeddings should behave more mathematically than vector-valued word embeddings.

MPO word embeddings also satisfy some interesting properties. Let denote the fitness level of a random MPO word embeddings that we obtain from the set $A$ where $A$ consists of the corpus that we are training our word embedding on along with a couple of hyperparameters. Then the random variable $χ_{A}$ will often have very low entropy or even zero entropy. Since $χ_{A}$ often has zero/low entropy, trained MPO word embeddings will not contain any/much random information that was not already present in the training data. It will be easier to interpret machine learning models when the trained model only depends on the training data and hyperparameters and which does not depend on random choices such as the initialization. Quaternionic and complex MPO word embeddings will often become real word embeddings after a change of basis. This means that MPO word embeddings behave quite mathematically, and it seems like machine learning models that behave in ways that mathematicians like would be more interpretable and understandable than other machine learning models.

Quaternionic matrices:

In this section, we shall go over the basics of quaternionic matrices. I assume that the reader is already familiar with ideas in linear algebra up through the singular value decomposition. Much of the basic theory of quaternionic matrices is a straightforward generalization of the theory of complex matrices. I hope you are also familiar with the quaternions, but if you are not, then I will remind you the definition and basic facts about quaternions. We refer the reader to [1] for facts about quaternions. You may skip this section if you simply care about the real and complex cases.

If $A$ is a square real, complex, or quaternionic matrix, then the spectral radius of $A$ is

$ρ (A) = {lim}_{n \to \infty} ∥ A^{n} ∥^{1 / n}$ where $∥ \cdot ∥$ is any matrix norm (since we are taking the limit, it does not matter which matrix norm we choose). If $A$ is a real or complex matrix, then

$ρ (A) = max {| λ | : λ is an eigenvalue of A}$ . The same fact holds for quaternionic matrices, but we have to be careful about how we define the eigenvalues of quaternionic matrices due to non-commutativity.

The division ring $H$ of quaternions is the 4-dimensional associative but non-commutative algebra over the field of real numbers spanned by elements $1, i, j, k$ (so $H = {a + b i + c j + d k : a, b, c, d \in R}$ ) where quaternionic multiplication is the unique associative (but non-commutative) bilinear operation such that $i^{2} = j^{2} = k^{2} = i j k = - 1$ . We observe that if $λ \in R, μ \in H$ , then $λ μ = μ λ$ . It is easy to show that $i j = k, j k = i, k i = j, j i = - k, k j = - i, i k = - j$ .

Recall that the conjugate and the absolute value of a quaternion are defined by

$¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ a + b i + c j + d k = a - b i - c j - d k$ and $| a + b i + c j + d k | = \sqrt{a^{2} + b^{2} + c^{2} + d^{2}}$ whenever $a, b, c, d \in R$ . If $u, v \in H$ , then $¯ ¯ ¯ u \cdot ¯ ¯ ¯ v = ¯ ¯¯¯¯¯¯¯¯ ¯ v \cdot u$ and $| u | \cdot | v | = | u \cdot v |$ . Observe that the field of complex numbers $C = {a + b i : a, b \in R}$ is a subring of $H$ .

If $A = (a_{i, j})_{i, j}$ is a quaternionic matrix, then define the adjoint of $A$ to be $A^{*} = (_{j, i})_{i, j}$ . While the adjoint of a quaternionic matrix is well-behaved, the transpose of a quaternionic matrix is not very well behaved since we typically have $A^{T} B^{T} \neq (B A)^{T}$ but $A^{*} B^{*} = (B A)^{*}$ for quaternionic matrices $A, B$ .

We shall now associate $m \times n$ -quaternionic matrices with $2 m \times 2 n$ -complex matrices.

Suppose that $A_{1}, A_{2}$ are $m \times n$ -complex matrices. Then the associated quaternionic complex matrix of the quaternionic matrix $A = A_{1} + A_{2} \cdot j$ is the complex matrix

$χ_{A} = [\begin{matrix} A_{1} & A_{2} -_{2} & _{1} \end{matrix}]$ . We observe that $χ_{A + B} = χ_{A} + χ_{B}$ , and $χ_{A B} = χ_{A} χ_{B}$ , and $χ_{A^{*}} = (χ_{A})^{*}$ whenever $A, B$ are quaternionic matrices and the operations are defined.

An eigenvalue of a quaternionic matrix $A$ is a quaternion $λ$ such that $A x = x \cdot λ$ for some non-zero vector $x$ .

Observation: If $λ$ is an eigenvalue of a quaternionic matrix $A$ corresponding to the eigenvector $x$ and $μ$ is an eigenvalue of a quaternionic matrix $B$ corresponding to the same eigenvector $x$ , then $A B x = A x μ = x λ μ$ , so $λ μ$ is an eigenvector of the quaternionic matrix $A B$ .

Observation: If $λ$ is an eigenvalue of a product of quaternionic matrices $A B$ and $A B x = x λ$ , then $(B A) (B x) = (B x) λ$ , so if $B x \neq 0$ , then $λ$ is also an eigenvalue of $B A$ . In particular, $A B$ and $B A$ have the same non-zero eigenvalues.

Observation: If $A$ is a quaternionic matrix and $λ$ is an eigenvalue of $A$ , then whenever $μ$ is a non-zero quaternion, the value $μ^{- 1} λ μ$ is also an eigenvalue of $A$ .

Proof: If $λ$ is an eigenvalue of $A$ , then there is some non-zero quaternionic vector $x$ with $A x = x λ$ . Therefore, set $y = x μ$ . Then $A y = A x μ = x λ μ = y μ λ μ^{- 1}$ , so $μ^{- 1} λ μ$ is an eigenvalue of the quaternionic matrix $A$ . $□$

Let $G$ denote the group of all quaternions $λ$ with $| λ | = 1$ . Let $SU (n)$ denote the group of all $n \times n$ -unitary matrices with determinant $1$ . Let $O (n)$ denote the group of all $n \times n$ - orthogonal matrices. Then the mapping $λ \mapsto χ_{λ}$ is an isomorphism from $G$ to $SU (2)$ . Let $V$ be the vector subspace of $H$ consisting of all elements of the form $a i + b j + c k$ where $a, b, c$ are real numbers. Then $V$ is an inner product space where inner product is the standard dot product operation on $V$ where $⟨ a i + b j + c k, r i + s j + t k ⟩ = a r + b s + c t$ and where $⟨ λ, μ ⟩ = - Re (λ μ)$ for quaternions $λ, μ \in V$ ). Then we can identify $O (3)$ with the linear transformations that preserve this inner product. We may now define a group homomorphism $ϕ : G \to O (3)$ by letting $ϕ (λ) (μ) = λ μ λ^{- 1}$ . Then the homomorphism $ϕ$ is a surjective homomorphism. In fact, $O (3)$ is a connected Lie group with $π_{1} (O (3)) ≃ Z_{2}$ , and the group $G$ is simply connected; $G$ is the universal cover of the lie group $O (3)$ . From these facts about quaternions, we may make a few observations:

Observation: If $λ_{1}, λ_{2}$ are quaternions, then there is some $μ$ with $λ_{2} = μ λ_{1} μ^{- 1}$ if and only if $Re (λ_{1}) = Re (λ_{2})$ and $| λ_{1} | = | λ_{2} |$ .

Observation: Suppose that $A$ is a square quaternionic matrix and $λ_{1}, λ_{2}$ are quaternions with $Re (λ_{1}) = Re (λ_{2})$ and $| λ_{1} | = | λ_{2} |$ . Then $λ_{1}$ is an eigenvalue of $A$ if and only if $λ_{2}$ is an eigenvalue of $A$ .

Observation: Suppose that $x$ is a quaternionic vector, and $x_{1}, x_{2}$ are complex vectors with $x = x_{1} + x_{2} \cdot j$ . Suppose furthermore that $λ$ is a complex number. Then $A x = x λ$ if and only if $χ_{A} [\begin{matrix} x_{1} -_{2} \end{matrix}] = λ [\begin{matrix} x_{1} -_{2} \end{matrix}]$ . In particular, $λ$ is an eigenvalue of $A$ if and only if $λ$ is an eigenvalue of $χ_{A}$ .

Observation: Let $A$ be a square quaternionic matrix. Then the following quantities are equivalent:

$ρ (χ_{A})$
${lim}_{n \to \infty} ∥ A^{n} ∥^{1 / n}$
$max {| λ | : λ is an eigenvalue of A}$ .

We shall therefore define the spectral radius of a square quaternionic matrix $A$ to be $ρ (χ_{A})$ (or any of the quantities in the above observation).

If $A$ is a quaternionic matrix, then we say that $A$ is Hermitian, normal, or unitary respectively if $χ_{A}$ is Hermitian, normal, or unitary. If $A$ is an $n \times n$ -quaternionic matrix, then $A$ is Hermitian iff $A = A^{*}$ , and $A$ is normal iff $A A^{*} = A^{*} A$ , and $A$ is unitary iff $1_{n} = A A^{*} = A^{*} A$ .

If $u, v$ are quaternionic column vectors, then define the quaternionic inner product $⟨ u, v ⟩_{H}$ of $u, v$ by $⟨ u, v ⟩_{H} = v^{*} u$ . We observe that the quaternionic inner product is real bilinear and conjugate symmetric $⟨ u + v, w ⟩_{H} = ⟨ u, w ⟩_{H} + ⟨ v, w ⟩_{H}$ , $⟨ u, v ⟩_{H} = ¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ ⟨ v, u ⟩_{H}$ . The quaternionic inner product preserves right scalar multiplication $⟨ u λ, v ⟩_{H} = ⟨ u, v ⟩_{H} λ$ . We define the real inner product of two quaternionic vectors by setting $⟨ u, v ⟩_{R} = Re (v^{*} u) = \frac{1}{2} (u^{*} v + v^{*} u) = Re (⟨ u, v ⟩_{H}) = \frac{1}{2} (⟨ u, v ⟩_{H} + ⟨ v, u ⟩_{H})$ . We may recover the quaternionic inner product from the real-valued inner product since $⟨ u, v ⟩_{H} = ⟨ u, v ⟩_{R} - ⟨ u i, v ⟩_{R} i - ⟨ u j, v ⟩_{R} j - ⟨ u k, v ⟩_{R} k$ for quaternionic vectors $u, v$ .

Observation: If $u, v$ are quaternionic vectors, then $ρ (u v^{*}) = | v^{*} u | = | ⟨ u, v ⟩_{H} |$ and $∥ u v^{*} ∥_{p} = ∥ u ∥ \cdot ∥ v ∥$ for $1 \leq p \leq \infty$ .

Observation: Suppose $u, v$ are quaternionic vectors. Then $| ⟨ u, v ⟩_{H} | \leq ∥ u ∥ \cdot ∥ v ∥$ .

Observation: Suppose $u, v$ are non-zero quaternionic vectors. If $| ⟨ u, v ⟩_{H} | = ∥ u ∥ \cdot ∥ v ∥$ , then $v γ = u$ for some quaternion $γ$ .

Counterexample: In general, if $α$ is a non-zero quaternion, and $u$ is a quaternionic vector, then $∥ (α u)^{*} u ∥ \neq ∥ α u ∥ \cdot ∥ u ∥$ .

Observation: Let $A$ be an $m \times n$ quaternionic matrix, and let $B$ be an $n \times m$ -quaternionic matrix. Then the following are equivalent:

$A = B^{*}$
$⟨ A u, v ⟩_{H} = ⟨ u, B v ⟩_{H}$ for all quaternionic vectors $u, v$ .
$⟨ A u, v ⟩_{R} = ⟨ u, B v ⟩_{R}$ for all quaternionic vectors $u, v$ .

Proposition: Let $A$ be a quaternionic matrix. Then the following are equivalent:

$⟨ A u, A v ⟩_{H} = ⟨ u, v ⟩_{H}$ for all quaternionic vectors $u, v$ .
$⟨ A u, A v ⟩_{R} = ⟨ u, v ⟩_{R}$ for all quaternionic vectors $u, v$ .
$∥ A u ∥ = ∥ u ∥$ for all quaternionic vectors $u$ .
$A^{*} A = I$
$χ_{A}$ is an isometry.

If $A$ is a square quaternionic matrix, then the above statements are all equivalent to the following:

6. $A$ is unitary.

Theorem: (quaternionic singular value decomposition) If $A$ is a quaternionic matrix, then there exists quaternionic unitary matrices $U, V$ where $U A V$ is a diagonal matrix $D$ with non-negative real non-increasing diagonal entries.

In the above theorem, the diagonal entries in $D$ are called the singular values values of $A$ and they are denoted by $σ_{1} (A), \dots, σ_{n} (A)$ where $σ_{1} (A) \geq \dots \geq σ_{n} (A)$ .

If $A$ is a real, complex, or quaternionic matrix, and $1 \leq p \leq \infty$ then define the Schatten $p$ -norm of $A$ to be $∥ A ∥_{p} = ∥ (σ_{1} (A), \dots, σ_{n} (A)) ∥_{p}$ where we define $∥ (x_{1}, \dots, x_{n}) ∥_{p} = (| x_{1} |^{p} + \dots + | x_{n} |^{p})^{1 / p}$ for $1 \leq p < \infty$ and $∥ (x_{1}, \dots, x_{n}) ∥_{\infty} = max (| x_{1} |, \dots, | x_{n} |)$ .

Observation: If $A$ is a quaternionic matrix with singular values $(σ_{1} (A), \dots, σ_{n} (A))$ , then $χ_{A}$ has singular values $(σ_{1} (A), σ_{1} (A), σ_{2} (A), σ_{2} (A), \dots, σ_{n} (A), σ_{n} (A))$ . In particular, $∥ A ∥_{\infty} = ∥ χ_{A} ∥_{\infty}$ and $∥ χ_{A} ∥_{p} = 2^{1 / p} \cdot ∥ A ∥_{p}$ , so $∥ A ∥_{p} = 2^{- 1 / p} \cdot ∥ χ_{A} ∥_{p}$ whenever $1 \leq p < \infty$ .

Observation: If $A$ is a quaternionic matrix, then $∥ A ∥_{\infty} = max \frac{∥ A v ∥}{∥ v ∥}$ .

If $A$ is an $n \times n$ -quaternionic matrix, then define the quaternionic trace of $A$ as

$HTr (A) = \sum_{r = 1}^{n} a_{r, r}$ . In general, $HTr (A B) \neq HTr (B A)$ , so the notion of the quaternionic trace is deficient.

If $A$ is an $n \times n$ -quaternionic matrix, then define the real trace of $A$ as $RTr (A) = \sum_{r = 1}^{n} Re (a_{r, r}) .$ We observe that $Tr (χ_{A}) = 2 RTr (A)$ and $RTr (A B) = RTr (B A)$ whenever $A, B$ are quaternionic matrices.

Observe that we can recover $HTr$ from $RTr$ by using the formula $HTr (A) = RTr (A) - i \cdot RTr (i \cdot A) - j \cdot RTr (j \cdot A) - k \cdot RTr (k \cdot A)$ .

If $A, B$ are $n \times n$ -quaternionic matrices, then define the real-valued Frobenius inner product $⟨ A, B ⟩_{R}$ as $⟨ A, B ⟩_{R} = RTr (A B^{*})$ .

If $A = (a_{r, s} + b_{r, s} i + c_{r, s} j + d_{r, s} k)_{r, s}$ and $B = (w_{r, s} + x_{r, s} i + y_{r, s} j + z_{r, s} k)_{r, s}$ where every $a_{r, s}, b_{r, s}, c_{r, s}, d_{r, s}, w_{r, s}, x_{r, s}, y_{r, s}, z_{r, s}$ is real, then

$⟨ A, B ⟩_{R} = \sum_{r, s} a_{r, s} w_{r, s} + b_{r, s} x_{r, s} + c_{r, s} y_{r, s} + d_{r, s} z_{r, s}$ . Define $∥ A ∥_{R} = \sqrt{⟨ A, A ⟩_{R}}$ .

We observe that if $U, V$ are unitary, then $∥ U A V ∥_{R}^{2} = ⟨ U A V, U A V ⟩ = RTr (U A V (U A V)^{*})$

$= RTr (U A V V^{*} A^{*} U^{*}) = RTr (A V V^{*} A^{*} U^{*} U) = RTr (A A^{*}) = ∥ A ∥_{R}^{2}$ .

In particular, if $A = U D V$ where $U, V$ are unitary and $D$ is diagonal with diagonal entries $σ_{1} (A) \geq \dots \geq σ_{n} (A)$ , then $∥ A ∥_{R}^{2} = ∥ D ∥_{R}^{2} = σ_{1} (A)^{2} + \dots + σ_{n} (A)^{2} = ∥ A ∥_{2}^{2}$ .

MPO word embeddings

Let $K$ denote either the field of real numbers, complex numbers, or the division ring of quaternions. Let $A$ be a finite set (in our case, $A$ will denote the set of all tokens). Let $Q_{d} (K; A)$ be the set of all functions $f : A \to M_{d} (K)$ such that $\sum_{a \in A} f (a) f (a)^{*} = 1_{d}$ . Observe that $Q_{d} (K; A)$ is compact. Define a function $L_{d, K} : A^{*} \times Q_{d} (K; A) \to [0, \infty)$ by letting $L_{d, K} (a_{1} \dots a_{n}, f) = ρ (f (a_{1}) \dots f (a_{n}))^{1 / n}$ . We say that $f$ is an MPO word pre-embedding for the string of tokens $a_{1} \dots a_{n}$ if the quantity $L_{d, K} (a_{1} \dots a_{n}, f)$ is locally maximized. To maximize $L_{d, K} (a_{1} \dots a_{n}, f)$ , the function $f$ must simultaneously satisfy two properties. Since $\sum_{a \in A} f (a) f (a)^{*} = 1_{d}$ , the matrices $(f (a))_{a \in A}$ must be spread out throughout all the $d$ dimensions. But we also need for the matrices $f (a_{r})$ to be compatible with $f (a_{r - 1})$ and $f (a_{r + 1})$ and more generally with $f (a_{s})$ for $s$ near $r$ .

We say that a collection of vectors $(v_{1}, \dots, v_{n})$ is a tight frame if $v_{1} v_{1}^{*} + \dots + v_{n} v_{n}^{*} = λ \cdot 1_{d}$ for some necessarily positive constant $λ$ . We say that a tight frame $(v_{1}, \dots, v_{n})$ is an equal norm tight frame if $∥ v_{1} ∥ = \dots = ∥ v_{n} ∥$ .

Theorem: Suppose that $d \leq n$ . Let $K$ denote either the field of real numbers or complex numbers. Let $S$ denote the unit sphere in $K^{d}$ . Then the local minimizers of the frame potential $F P : S^{n} \to [0, \infty)$ defined by $F P (v_{1}, \dots, v_{n}) = \sum_{j = 1}^{n} \sum_{k = 1}^{n} | ⟨ v_{j}, v_{k} ⟩ |^{2}$ are global minimizers which are tight frames for $K^{d}$ . In particular, there exists an equal norm tight frame $v_{1}, \dots, v_{n} \in R^{d}$ whenever $d \leq n$ .

See [2, Thm 6.9] for a proof.

Theorem: Suppose that $K$ is the field of real numbers, complex numbers, or the division ring of quaternions. Let $| A | = n, A = {a_{1}, \dots, a_{n}}$ . Then $L_{d, K} (a_{1} \dots a_{n}, f) \leq \sqrt{d / n}$ and $L_{d, K} (a_{1} \dots a_{n}, f) = \sqrt{d / n}$ if and only if there is an equal norm tight frame $(u_{1}, \dots, u_{n})$ with $∥ u_{1} ∥ = \dots = ∥ u_{r} ∥ = 1$ and $α_{1}, \dots, α_{n} \in K$ with $| α_{1} | = \dots = | α_{n} | = \sqrt{d / n}$ where $f (a_{r}) = u_{r} (u_{r + 1} α_{r + 1})^{*}$ for all $r$ (where addition in the subscripts is taken modulo $r$ ).

Proof: In order to not repeat ourselves, our proof shall apply to all three cases: $K = R, K = C, K = H$ , but for accessibility purposes, the proof shall be understandable to those are only familiar with real and/or complex matrices.

Suppose that $(u_{1}, \dots, u_{n})$ is an equal norm tight frame with $∥ u_{1} ∥ = \dots = ∥ u_{n} ∥ = 1$ , $α_{1}, \dots, α_{n} \in K$ are elements with $| α_{1} | = \dots = | α_{n} | = \sqrt{d / n}$ and $v_{r} = u_{r + 1} α_{r + 1}$ for all $r$ . Since $RTr (1_{d} \cdot (n / d)) = n$ and $RTr (u_{1} u_{1}^{*} + \dots + u_{n} u_{n}^{*}) = ∥ u_{1} ∥^{2} + \dots + ∥ u_{n} ∥^{2} = n$ , we have $u_{1} u_{1}^{*} + \dots + u_{n} u_{n}^{*} = (n / d) \cdot 1_{d}$ . Therefore,

$f (a_{1}) f (a_{1})^{*} + \dots + f (a_{n}) f (a_{n})^{*} = u_{1} v_{1}^{*} (u_{1} v_{1}^{*})^{*} + \dots + u_{n} v_{n}^{*} (u_{n} v_{n}^{*})^{*}$ $= u_{1} v_{1}^{*} v_{1} u_{1}^{*} + \dots + u_{n} v_{n}^{*} v_{n} u_{n}^{*} = u_{1} u_{1}^{*} \cdot ∥ v_{1} ∥^{2} + \dots + u_{n} u_{n}^{*} ∥ v_{n} ∥^{2}$

$= ∥ v_{1} ∥^{2} \cdot (u_{1} u_{1}^{*} + \dots + u_{n} u_{n}^{*})$ $= (d / n) \cdot (u_{1} u_{1}^{*} + \dots + u_{n} u_{n}^{*})$ $= (d / n) \cdot (n / d) \cdot 1_{d} = 1_{d} .$

Therefore, $f \in Q_{d} (K; A)$ .

Now, $ρ (f (a_{1}) \dots f (a_{n})) = ρ (u_{1} v_{1}^{*} \dots u_{n} v_{n}^{*}) = ρ (u_{1} v_{1}^{*} \dots u_{n} v_{n}^{*}) = ρ (v_{1}^{*} u_{2} \dots v_{n - 1}^{*} u_{n} v_{n}^{*} u_{1})$

$= | v_{1}^{*} u_{2} | \dots | v_{n}^{*} u_{1} | = (\sqrt{d / n})^{n} \cdot | u_{2}^{*} u_{2} | \dots | u_{1}^{*} u_{1} | = (\sqrt{d / n})^{n}$ . Therefore, we have

$L_{d, K} (a_{1}, \dots, a_{r}, f) = ρ (f (a_{1}) \dots f (a_{n}))^{1 / n} = \sqrt{d / n}$ .

Suppose now that $f \in Q_{d} (K; A)$ . By the arithmetic-geometric mean inequality, we have $ρ (f (a_{1}) \dots f (a_{n}))^{2 / n} \leq ∥ f (a_{1}) \dots f (a_{n}) ∥_{\infty}^{2 / n} \leq ∥ f (a_{1}) ∥_{\infty}^{2 / n} \dots ∥ f (a_{n}) ∥_{\infty}^{2 / n}$

$\leq ∥ f (a_{1}) ∥_{2}^{2 / n} \dots ∥ f (a_{n}) ∥_{2}^{2 / n} \leq \frac{∥ f (a_{1}) ∥_{2}^{2} + \dots + ∥ f (a_{n}) ∥_{2}^{2}}{n}$

$= \frac{RTr (f (a_{1}) f (a_{1})^{*} + \dots + f (a_{n}) f (a_{n})^{*})}{n} = \frac{RTr (1_{d})}{n} = d / n .$

Suppose now that $ρ (f (a_{1}) \dots f (a_{n})) = d / n$ . We observe that $∥ f (a_{r}) ∥_{\infty} = ∥ f (a_{r}) ∥_{2}$ precisely when $Rank (f (a_{r})) \leq 1$ . From the arithmetic-geometric mean inequality, we have $∥ f (a_{1}) ∥_{2}^{1 / n} \dots ∥ f (a_{n}) ∥_{2}^{1 / n} = \frac{∥ f (a_{1}) ∥_{2} + \dots + ∥ f (a_{n}) ∥_{2}}{n}$ precisely when $∥ f (a_{1}) ∥_{2} = \dots = ∥ f (a_{n}) ∥_{2}$ . Therefore, each $f (a_{r})$ has rank at most $1$ and $∥ f (a_{r}) ∥_{2} = \sqrt{d / n}$ for $1 \leq r \leq n$ , so we can set $f (a_{r}) = u_{r} v_{r}^{*}$ where $∥ u_{r} ∥ = 1$ and $∥ v_{r} ∥ = \sqrt{d / n}$ .

Observe that $ρ (f (a_{1}) \dots f (a_{n})) = ρ (u_{1} v_{1}^{*} \dots u_{n} v_{n}^{*})$ $= | v_{n}^{*} u_{1} v_{1}^{*} \dots u_{n - 1} v_{n - 1}^{*} u_{n} | = | v_{n}^{*} u_{1} | \dots | v_{n - 1}^{*} u_{n} |$ . Therefore,

$| v_{n}^{*} u_{1} | \dots | v_{n - 1}^{*} u_{n} | = ∥ f (a_{1}) ∥_{\infty} \dots ∥ f (a_{n}) ∥_{\infty}$ $= ∥ u_{1} ∥ \dots ∥ u_{n} ∥ \cdot ∥ v_{1} ∥ \dots ∥ v_{n} ∥$ , so

$| v_{r}^{*} u_{r + 1} | = ∥ u_{r + 1} ∥ \cdot ∥ v_{r} ∥$ for all $r$ .

Therefore, there are quaternions $α_{1}, \dots, α_{n}$ with $| α_{1} | = \dots = | α_{n} | = \sqrt{d / n}$ and where $v_{r} = u_{r + 1} \cdot α_{r}$ for all $r$ .

We observe that $1_{d} = f (a_{1}) f (a_{1})^{*} + \dots + f (a_{n}) f (a_{n})^{*}$ $= u_{1} v_{1}^{*} (u_{1} v_{1}^{*})^{*} + \dots + u_{n} v_{n}^{*} (u_{n} v_{n}^{*})^{*} = u_{1} v_{1}^{*} v_{1} u_{1}^{*} + \dots + u_{n} v_{n}^{*} v_{n} u_{n}^{*}$

$= u_{1} u_{1}^{*} ∥ v_{1} ∥^{2} + \dots + u_{n} u_{n}^{*} ∥ v_{n} ∥^{2} = (u_{1} u_{1}^{*} + \dots + u_{n} u_{n}^{*}) \cdot (d / n)$ .

Therefore, $(u_{1}, \dots, u_{n})$ is an equal norm tight frame. $□$

Some thoughts:

It seems easier to prove theorems about MPO word pre-embeddings than it is to prove theorems about other machine learning models simply because MPO word pre-embeddings are more mathematical in nature. Of course, neural networks with ReLU activation are mathematical too, so we can prove theorems about them, but MPO word pre-embeddings are more like the objects that mathematicians like to investigate. And it seems easier to mathematically prove theorems that interpret MPO word pre-embeddings than it is to prove theorems that interpret other machine learning models. On the other hand, we are making a tradeoff here. MPO word pre-embeddings behave more mathematically, but word embeddings are simply the first layer in natural language processing.

Why use the spectral radius?

The matrix $f (a_{1}) \dots f (a_{n})$ is in general approximately a rank-1 matrix. This means that $ρ (f (a_{1}) \dots f (a_{n}))^{1 / n} \approx ∥ f (a_{1}) \dots f (a_{n}) ∥^{1 / n} \approx | u^{*} f (a_{1}) \dots f (a_{n}) v |$ whenever $u, v$ are unit vectors. One may be tempted to define the fitness of the $f$ as $∥ f (a_{1}) \dots f (a_{n}) ∥^{1 / n}$ or $| u^{*} f (a_{1}) \dots f (a_{n}) v |$ . But mathematical objects are better behaved when taking limits. Instead of considering the string $a_{1} \dots a_{n}$ as our corpus, we may consider the similar corpus $(a_{1} \dots a_{n})^{m}$ as $m \to \infty$ , and in this case, the fitness of $f$ would be $∥ (f (a_{1}) \dots f (a_{n}))^{m} ∥^{1 / m}$ or $u^{*} (f (a_{1}) \dots f (a_{n}))^{m} v$ which will both converge to $ρ (f (a_{1}) \dots f (a_{n}))^{1 / n}$ as $m \to \infty$ . The use of the spectral radius simplifies and improves the behavior of the machine learning model for the same reason that integrals such as $\int_{1}^{x} \frac{d t}{t}$ behave better than sums $\sum_{k = 1}^{n} \frac{1}{k}$ .

Why complex numbers and quaternions?

While it takes more time to compute MPO word pre-embeddings for the complex numbers and for the quaternions, the complex and quaternionic MPO word pre-embeddings have some advantages over real MPO word pre-embeddings. It is currently unclear as to whether the advantages of complex and quaternionic MPO word pre-embeddings outweigh the cost of the increased computational complexity of training and using complex or quaternionic word pre-embeddings. Further research is needed on this topic.

Complex and quaternionic matrices provide new ways of testing MPO word pre-embeddings which are not available if we simply used real matrices. For example, if $f$ is a complex or quaternionic MPO word pre-embedding, then the existence of a unitary matrix $U$ where $U \cdot f \cdot U^{- 1}$ is a real matrix should be considered evidence that the MPO word pre-embedding $f$ is a high quality machine learning model while the non-existence of $U$ .

A complex or quaternionic MPO word pre-embedding $f \in Q_{d} (K; A)$ will typically have a higher fitness level than a corresponding real MPO word pre-embedding.

Sometimes, when one trains an MPO word pre-embedding twice to obtain two MPO word pre-embeddings $f, g$ , the fitness levels are equal ( $L_{d, K} (a_{1} \dots a_{n}, f) = L_{d, K} (a_{1} \dots a_{n}, g)$ ), and it is desirable to have equal fitness levels when training the word pre-embedding multiple times. But the probability of obtaining the equality $L_{d, K} (a_{1} \dots a_{n}, f) = L_{d, K} (a_{1} \dots a_{n}, g)$ will depend on the choice of $K$ . In many cases, we would have

$P (L_{d, R} (a_{1} \dots a_{n}, f) = L_{d, R} (a_{1} \dots a_{n}, g)) < P (L_{d, K} (a_{1} \dots a_{n}, f) = L_{d, K} (a_{1} \dots a_{n}, g))$

for $K \in {C, H}$ . Of course, this probability depends on other factors besides the choice of division ring $K$ such as the initialization. For example, if the matrices of the form $f (a)$ are initialized to have random positive values, then the probability $P (L_{d, R} (a_{1} \dots a_{n}, f) = L_{d, R} (a_{1} \dots a_{n}, g))$ would be much greater than if the matrices of the form $f (a)$ are initialized to have random real values; if each $f (a)$ initially has random positive values, then I would currently consider the real-valued MPO pre-embeddings to be about as good as the complex-valued MPO pre-embeddings but easier to compute.

The fitness function $L_{d, K}$ is not differentiable everywhere, and the gradient of $L_{d, K}$ has singularities. The singularities of $L_{d, K}$ have real codimension $[K : R]$ . In the case when $K = R$ the singularities of the gradient of $L_{d, K}$ disconnect the domain $Q_{d} (K; A)$ , and gradient ascent has difficulty crossing these singularities to reach a good local maximum.

Disadvantages of MPO word pre-embeddings:

Projectivity: If $M$ is a ring, then the center of $M$ is the collection $Z (M)$ of all elements $x \in M$ where $x y = y x$ for all $y \in M$ . It is easy to show that $Z (M)$ is always a subring of $M$ . We observe that $Z (R) = R, Z (C) = C, Z (H) = R$ . If $f, g : A \to M_{d} (K)$ , $λ_{a} \in Z (K)$ $, | λ_{a} | = 1$ for $a \in A$ , and $f (a) = λ_{a} \cdot g (a)$ for $a \in A$ , then $f$ is an MPO word pre-embedding if and only if $g$ is an MPO word pre-embedding. This means that after one trains an MPO word pre-embedding $g$ , one needs to find the constants $λ_{a} \in Z (K), | λ_{a} | = 1$ where the function $f$ defined by $f (a) = λ_{a} \cdot g (a)$ behaves optimally. I may talk about how we can find the constants $λ_{a}$ in another post.
Locality: An MPO word pre-embedding is good at relating tokens to their immediate surroundings in the following sense. Suppose $f$ is an MPO word pre-embedding for the string $a_{1} \dots a_{n}$ . Then the value of $f (a)$ will be determined mostly by all the other values $f (b)$ for $b \neq a$ together with the multiset of all strings $a_{i - 2} a_{i - 1} a_{i} a_{i + 1} a_{i + 2}$ such that $a = a_{i}$ . In other words, $f (a)$ can see the immediate surroundings $a_{i - 2} a_{i - 1} a_{i} a_{i + 1} a_{i + 2}$ of $a = a_{i}$ , but $f (a)$ will not be able to see much more than this.
Difficulty utilizing all dimensions without noise: Sometimes MPO word pre-embeddings behave poorly because it will be difficult to maximize the spectral radius while we still have $\sum_{a \in A} f (a) f (a)^{*} = 1_{d}$ . To ameliorate this problem, we can relax the requirement for $\sum_{a \in A} f (a) f (a)^{*} = 1_{d}$ to the condition $∥ \sum_{a \in A} f (a) f (a)^{*} ∥_{p} = d^{1 / p}$ for some $p > 1$ . But in this case, the MPO word pre-embedding will not use all of the dimensions in $K^{d}$ evenly.

Conclusion:

Spectral methods have already been quite valuable in machine learning for a good reason. I will continue to make more posts about using the spectral radius (and similar objects) to construct machine learning models.

Another possible advantages of quaternions: Added 9/8/2023

It seems like an increase in the value $d$ provides diminishing returns in the performance of MPO word embeddings since when $d$ is large, MPO word embeddings have difficulty utilizing all $d$ dimensions equally. MPO word embeddings therefore can only give us some information about a token. However, it seems like increasing the index $[K : R]$ increases the number of parameters of MPO word embeddings without the word embedding encountering substantially more difficulty in utilizing all $d$ dimensions significantly. Therefore, MPO word embeddings should perform better simply by increasing the index $[K : R]$ . This means that complex MPO word embeddings should perform better than real MPO word embeddings, and quaternionic MPO word embeddings should perform better than complex MPO word embeddings. On the other hand, we still need to perform experiments to determine whether complex MPO word embeddings are that much better than real MPO word embeddings and quaternionic MPO word embeddings are even better than both real and complex MPO word embeddings.

References:

Quaternions and matrices of quaternions. Fuzhen Zhang. Linear Algebra and its Applications. Volume 251, 15 January 1997, Pages 21-57
An Introduction to Finite Tight Frames. Shayne F. D. Waldron. July 26, 2017

4 comments

Comments sorted by top scores.

comment by the gears to ascension (lahwran) · 2024-01-05T22:57:17.304Z · LW(p) · GW(p)

this does look like it might be interesting, but I think you might need to show - possibly visually - why this works. what does an embedding of a 5-word dataset look like on a chart? how does one interpret it on that chart? why do these expressions map to that chart, etc? that would allow introducing the math you're using to anyone who doesn't know it or is rusty, while reducing effort to follow it for those who don't know it. If you're only intending to communicate to those who already get the prereqs, which is a thing people do, then, well, post more posts like this and I'm sure someone who has the particular math background (quaternions?) will run into them eventually. it looks like the math is all here, just too dense for my current level of motivation, so maybe you just need the right eyes.

I personally can't evaluate your idea within the time I would allot to reading this post because you use a lot of expressions I'm not immediately familiar with and I don't see a way to shortcut through them in order to draw the conclusions about improved interpretability you're implying. but it does seem conceivable that it could be pretty cool. I can imagine why having explicitly disambiguated word senses would be useful, if you could get your embedding to be sturdy about them.

Replies from: joseph-van-name

↑ comment by Joseph Van Name (joseph-van-name) · 2024-01-07T14:58:16.780Z · LW(p) · GW(p)

I appreciate your input. I plan on making more posts like this one with a similar level of technical depth. Since I included a proof with this post, this post contained a bit more mathematics than usual. With that being said, others have stated that I should be aware of the mathematical prerequisites for posts like this, so I will keep the mathematical prerequisites in mind.

Here are some more technical thoughts about this.

We would all agree that the problem of machine learning interpretability is a quite difficult problem; I believe that the solution to the interpretability problem requires us not only to use better interpretability tools, but the machine learning models themselves need to be more inherently interpretable. MPO word embeddings and similar constructions have a little bit (but not too much) of difficulty since one needs to get used to different notions. For example, if we use neural networks using ReLU activation (or something like that), then one has less difficulty upfront, but when it comes time to interpret such a network, the difficulty in interpretability will increase since neural networks with ReLU activation do not seem to have the right interpretability properties, so I hesitate to interpret neural networks. And even if we do decide to interpret neural networks, the interpretability tools that we use may have a more complicated design than the networks themselves.
There are some good reasons why complex numbers and quaternions have relatively little importance in machine learning. And these reasons do not apply to constructions like MPO word embeddings.
Since equal norm tight frames are local minimizers of the frame potential, it would help to have a good understanding of the frame potential. For simplicity, it is a good idea to only look at the real case. The frame potential is a potential for a force between a collection of particles on the sphere where particles are repelled from each other (and from each other's antipodal point) and where the force tries to make all the particles orthogonal to each other. If $d = n$ , then it is possible to make all of the particles orthogonal to each other, and in this case, when we minimize this potential, the equal norm tight frames will simply be orthonormal bases. In the case when $d < n$ , we cannot make all of the particles orthogonal to each other, but we can try to get as close as possible. Observe that unlike the Newtonian and logarithmic potential, the frame potential does not have a singularity for when the two particles over lap. I will leave it to you to take the gradient (at least in the real case) of the frame potential to see exactly what this force does to the particles.
Training an MPO word embedding with the complex numbers of quaternions is actually easier in the sense that for real MPO word embeddings, one needs to use a proper initialization, but with complex and quaternionic MPO word embeddings, an improper initialization will only result in minor deficiencies in the MPO word embedding. This means that the quaternions and complex numbers are easier to work with for MPO word embeddings than the real numbers. In hindsight, the solution to the problem of real MPO word embeddings is obvious, but at the time, I thought that I must use complex or quaternionic matrices.
I like the idea of making animations, but even in the real case where things are easy to visualize, the equal norm tight frames are non-unique and they may involve many dimensions. The non-uniqueness will make it impossible to interpret the equal norm tight frames; for the same reason, it is hard to interpret what is happening with neural networks since if you retrain a neural network with a different initialization or learning rate, you will end up with a different trained network, but MPO word embeddings have much more uniqueness properties that make them easier to interpret. I have made plenty of machine learning training animations and have posted these animations on YouTube and TikTok, but it seems like in most cases, the animation still needs to be accompanied by technical details; with just an animation, the viewers can see that something is happening with the machine learning model, but they need both the animation and technical details to interpret what exactly is happening. I am afraid that most viewers just stick with the animations without going into so many technical details. I therefore try to make the animations more satisfying than informative most of the time.

comment by Joseph Van Name (joseph-van-name) · 2023-09-04T18:04:07.402Z · LW(p) · GW(p)

Is the Latex compiling here?

Replies from: joseph-van-name

↑ comment by Joseph Van Name (joseph-van-name) · 2023-09-05T15:08:05.382Z · LW(p) · GW(p)

I made the Latex compile by adding a space. Let me know if there are any problems.

comment by Joseph Van Name (joseph-van-name) · 2023-09-04T18:03:20.999Z · LW(p) · GW(p)

Interpreting a matrix-valued word embedding with a mathematically proven characterization of all optima

Contents

4 comments