The Additive Summary Equation
post by johnswentworth · 2021-07-13T18:23:06.016Z · LW · GW · 3 commentsContents
The Summary Equation The Additive Summary Equation Main Theorem Main Trick of the Proof Rest of the Proof Loose Threads Some Special Cases Structure Symmetry None 3 comments
This post contains some theorems and proofs needed for a hopefully-upcoming post on some powerful generalizations of the Koopman-Pitman-Darmois (KPD) Theorem. Unless you find functional equations interesting in their own right, and want to read some pretty dense math, you should probably skip this post. The theorems are pretty self-contained, and will be summarized in any future posts which need them.
The Summary Equation
We can represent the idea of a -dimensional “summary” of for a function via a functional equation:
Given the function , we try to find some -dimensional “summary” such that can be computed from - i.e. we want some such that for all .
In order for this to be meaningful, we need some mild assumptions on , , and ; at the very least, we certainly need to exclude space-filling curves, which would defeat the point of a “-dimensional summary”. Throughout this post, we’ll assume differentiability, although this should be easy to relax somewhat by taking limits of differentiable functions.
Easy theorem: The -dimensional summary equation for is solvable only if the rank of the matrix is at most for all values of . I’ll call this the “Summarizability Theorem”. (If you want a more official-sounding name, it’s the global converse of the constant-rank theorem.)
Proof: differentiate both sides of the equation to get . Since is -dimensional, this is itself a rank-at-most- decomposition of .
In practice, the converse will also usually hold: if the rank of is at most for all values of , then we can usually find a -dimensional summary . Indeed, if the rank is constant near some point , then we can always find a local -dimensional summary near ; that’s what the constant rank theorem says. However, Weird Stuff can sometimes prevent stitching these local summaries together into a global summary. (Thank you to Vanessa [LW · GW] for pointing me to an example of such “Weird Stuff”, as well as the name of the constant rank theorem.)
Minor notation point: each variable corresponds to a column of . This convention will be used throughout the post. We will also assume that each is one-dimensional; higher-dimensional variables are represented by their components.
The Additive Summary Equation
The heart of the generalized KPD theorems is a family of special cases of the Summary Equation in which is a sum of terms, each of which depend on only a few variables. I’ll call this the Additive Summary Equation. The most general version looks like this:
… where are (known) smooth functions of output dimension , and specify (known) indices of . Notation example: if we have a term , then and .
The notation here stands for a “neighborhood” induced by , specifying the indices of -variables on which depends. In the following sections, we’ll talk about the neighborhood of a variable , denoted . This consists of all the variables which are neighbors of in any of the -induced neighborhoods, i.e. . In other words, contains the indices of variables for which some depends on both and .
In the generalized KPD theorems, the neighborhoods reflect the graphical structure of the distribution. If factors according to a DAG [LW · GW]:
… then the corresponding functional equation looks like
… i.e. , with derived from . (Here the “parents” are nodes with arrows into node in the DAG.) For instance, if the are all conditionally independent (as in the original KPD), then the equation is simply
… i.e. . Another example: if the variables form a Markov Chain with , then the corresponding equation is
… i.e. .
Main Theorem
Let . Then the additive summary equation is solvable for and -dimensional only if can be expressed as
… for some at-most -dimensional functions , constant matrix of column-dimensional at-most , constant vector , and a set of at-most -indices . The notation denotes “neighbors” of , meaning -indices for which some depends on both and a variable in . (In particular, this means that all which depend on are constant when is held constant.) Furthermore, the sparsity structure of each (i.e. the set of variables on which depends) matches one of the . (See the end of the “Rest of the Proof” section for the exact correspondence.)
The theorem is interesting mainly when:
- The number of variables is much larger than the summary dimension , i.e. , and...
- The number of variables on which each depends is small, so each has few neighbors
When these conditions hold, will be nonzero only for a very small fraction of terms, so the impact of the vast majority of terms/variables on is mediated by the at-most -dimensional ; this sum serves as a summary for the (i.e. the variables which are not neighbors of the variables in ).
Intuitively, the simple result we’d “really like” is , with at-most -dimensional. This is not true in general for functions with dimensional summaries, but it is “almost true”: it holds for all but a few “exceptions”, i.e. a few extra terms/variables which can influence in more general ways. The number of exceptional variables is - i.e. the variables plus their neighbors.
Note that the theorem claims “only if”, but not “if”. In the other direction, we can make a slightly weaker statement: any satisfying the above form has a summary-function , with dimension at-most . The summary is just the at-most D-dimensional summary of , i.e. , plus the "exception" variables.
Main Trick of the Proof
Pick some point at which the rank of takes its maximum value (which is at most by the Summarizability Theorem). Then we can pick a set of -indices, of size at most (i.e. ), such that is a basis for the (at-most -dimensional) column span of . If the system is very sparse, then will only depend on a few of the -variables, namely .
Since spans the maximum number of dimensions, all columns of must fall within that span - otherwise the rank of would be greater. And since depends only on , this must hold for any values of the other variables . So, we can change the other variables any way we please, holding constant, and will remain in the span of .
Let be any basis for the span of . (Of course we could choose itself, but often there’s some cleaner basis, depending on the application.) Then is a projection matrix, projecting into the span. For any with , must fall within the span, which is equivalent to
(for )
Rest of the Proof
Next, we integrate. We’ll start at , then take any path from to holding constant. Then, we’ll go from to holding constant. So:
For the first integral, is held constant at , so by the previous section :
… and we’ll expand :
Now, we break the sum up into terms which do not depend on , i.e. for which (for which the second integral contributes zero), and terms which do depend on , i.e. for which (for which we can’t say anything nontrivial):
… and simplify a bit:
Since is -dimensional (on the right), that proves the theorem; we can choose with ranging over , and .
Loose Threads
This theorem is strong enough for my immediate needs, but still a little weaker than I’d ideally like.
First, there’s the converse of the Summarizability Theorem. In practice, when is rank at-most everywhere, I generally expect there to be a -dimensional summary. But there are exceptions, and I haven’t found a simple, convenient condition which is sufficient to ensure the existence of the summary and easily applies to most of our day-to-day functions. On the other hand, I haven’t spent that much effort looking for such a condition, so maybe someone can point me to it. It’s definitely the sort of thing I’d expect somebody else to have already solved to death.
Second, there’s probably room to reduce the freedom available to and the -dependent terms. In particular, I believe we can impose , rather than just and separately. This requires first reducing , so that it only spans the dimensions actually needed to summarize , rather than all the dimensions spanned by . Given that reduction of , the basic trick is to first go through the process from the above proof, but after that choose a new basis which includes variables from , and go through the whole construction again with to get a new . Terms dependent only on or only on can be summarized via , so the only variables which can’t be summarized this way are those dependent on variables in both and . We should be able to iterate this process until no further reduction of the “exception” terms occurs, which should happen when is equal to the maximum rank of .
In the special case where depends only on , this process of iteratively reducing the number of exception terms is relatively sraightforward, and we can indeed impose . (I'm not going to go through the proof here; consider it an exercise for the reader.) (In case anyone isn't familiar with what "exercise for the reader" means in math: don't actually do that exercise, it's a pain in the ass.)
Some Special Cases
There are two main classes of special cases: special “neighborhood” structure, and symmetry.
Structure
The simplest example of special neighborhood structure is when depends only on (corresponding to conditionally independent variables in the generalized KPD theorem). As alluded to above, we can then strengthen the theorem so that . Furthermore, “neighbors” are trivial: , so . That means the summary is at-most D-dimensional. Thus we have the converse of the theorem; it becomes an if-and-only-if.
Another useful structural constraint is when each has at most neighbors (including itself), i.e. for all . In that case, . If the number of variables is much larger than , i.e. , then this guarantees that the large majority of variables influence only via the at-most -dimensional summary .
Symmetry
By “symmetry”, I mean that is invariant under swapping some variables, e.g. swapping with . This is interesting mainly when we can swap a variable in with a variable not in . When that happens, both variables must be summarizable by . In particular, if every variable potentially in can always be swapped with a variable not in , then we can eliminate the exception terms altogether.
For instance, the original KPD assumed conditionally IID variables, corresponding to a summary equation with - i.e. each term is the same function acting on a different variable. In this case, any variable can be swapped with any other, so we can eliminate the exception terms; we must have for at-most D-dimensional . In fact, this is somewhat stronger than the corresponding result in the original KPD: it applies even when the number of variables is finite, whereas the original KPD only requires that the summary have a finite dimension as the number of variables increases to infinity.
3 comments
Comments sorted by top scores.
comment by Leon Lang (leon-lang) · 2023-02-09T21:27:35.574Z · LW(p) · GW(p)
Then is a projection matrix, projecting into the span.
To clarify: for this, you probably need the basis to be orthonormal?
Replies from: johnswentworth↑ comment by johnswentworth · 2023-02-09T21:42:19.413Z · LW(p) · GW(p)
The "dagger" indicates a pseudoinverse, not a transpose, which is why this works even with non-orthonormal U. But an orthonormal basis would probably be most convenient; in that case the pseudoinverse is just the transpose.
Replies from: leon-lang↑ comment by Leon Lang (leon-lang) · 2023-02-09T21:50:44.730Z · LW(p) · GW(p)
Thanks!