Generalizing Koopman-Pitman-Darmois
post by johnswentworth · 2021-07-15T22:33:03.772Z · LW · GW · 14 commentsContents
The Theorems Independent But Non-Identical KPD Theorem Proof Key Takeaways Causal Model/Bayes Net KPD Theorem Proof Key Takeaways IID KPD Theorem Proof Key Takeaways Takeaways Summary Appendix: Additive Summary Equation Theorem None 14 comments
Suppose we want to estimate the mean and variance of a normal distribution from some samples from the distribution. It turns out that we don’t need to keep around all of the data: only the sample mean and sample variance are needed for the estimation problem. From those, we can calculate the entire posterior distribution of . The pair is a “sufficient statistic” for : a statistic which summarizes all the estimation-relevant information from the data.
In principle, every estimation problem has a sufficient statistic: just let the summary be all the data. But this isn’t very useful. We really want to compress the data, pull out the actually-useful parts and throw away the rest. In the normal distribution example, our samples of data are each 1-dimensional for a total of dimensions of data, but the summary is only 2-dimensional, no matter how large grows.
When will an estimation problem have a fixed-dimensional sufficient statistic as the number of data points increases, like the normal-distribution estimation problem? The Koopman-Pitman-Darmois (KPD) Theorem says that, when estimating distribution parameters from IID samples, there is a fixed-dimensional sufficient statistic if-and-only-if the distribution can be expressed in exponential family form:
… for some vector-valued functions and positive-real-valued function . The sufficient statistic is then . For instance, for a normal distribution, the parameters are , and we can use and .
Intuitively, this says the “coupling” between and is only via the fixed-dimensional “summary” vectors and .
That’s the original KPD Theorem. We’ll see in this post that the theorem generalizes a lot. It’s not just estimation of distribution parameters from IID samples - KPD can also generalize to estimation of parameters for Markov chains, or causal models with symmetry [LW · GW] - aka stochastic programs. The theorem also extends to causal models without any symmetry, albeit slightly weaker: a small number of variables can violate the exponential form, but (roughly speaking) the number of variables which can violate the exponential form is proportional to the summary dimension. So if we have a low-dimensional summary of lots of variables, the vast majority of variables must still satisfy the exponential form. The condition which does most of the work is sparsity: we’ll assume the conditional distribution factors according to a causal model with a sparse graph. Or, in more physical terms: direct interactions are local; long-range interactions are always mediated by short-range interactions. Conditionally independent variables are just the most extreme version of this - their conditional distribution factors according to a graph with no edges at all (i.e. there are no interactions at all).
In particular, I now expect most abstractions [? · GW] to mostly satisfy the exponential family form, though that’s a topic for future posts.
The Theorems
We’ll cover three generalized KPD theorems, each with different assumptions on the structure of :
- Independent But Non-Identical
- Causal Model/Bayes Net
- IID
The first is intended as a relatively-simple case to illustrate the core ideas, before we introduce more complexity. The second is the most general theorem. The third is intended to illustrate how we can leverage symmetry to rule out the annoying “exception” variables.
This is definitely not a comprehensive set of KPD generalizations; these theorems and proofs are mainly intended to illustrate the general tools and techniques by which KPD generalizations can be derived.
Independent But Non-Identical KPD
The simplest starting point for these theorems is parameter estimation from independent-but-not-identically-distributed variables. (If you want a concrete image in mind, picture scientists trying to estimate the values of some physical constants using a bunch of different experiments.)
Theorem
Let be continuous random variables, conditionally independent given (i.e. ). There exists a D-dimensional sufficient statistic summarizing all information in relevant to only if the distribution has the form
… where:
- are some at-most D-dimensional vector-valued summary functions
- is a set of “exception” variables, not satisfying the exponential form
- The number of exception variables is at-most D
Intuitively, this is like the original KPD, but some variables can be “exceptions” and not follow the exponential form - their coupling with is not mediated by the summary vectors and . However, there can only be at-most D exceptions. When the number of variables is much greater than the summary dimension (i.e. D), this means that the vast majority of variables must satisfy the exponential form. The summary vectors mediate the interactions with of all but at-most D variables.
Note that this is an “only if” theorem, not an “if-and-only-if” theorem. It does establish that a fixed-dimension sufficient statistic exists if-and-only-if the distribution takes the above form, since any distribution with the above form can use as a sufficient statistic. However, that sufficient statistic has dimension at-most 2D, whereas the “only if” theorem above assumes the existence of a statistic of dimension D. So there’s a “gap” in the if-and-only-if characterization: any conditionally-independent distribution with a D-dimensional sufficient statistic satisfies the form, but given a distribution which satisfies the form, we may only be able to find a 2D-dimensional sufficient statistic rather than a D-dimensional sufficient statistic.
Of course, for applications with D, the difference between D and 2D won't matter much anyway.
Proof
The first key idea is the Minimal Map Theorem [LW · GW] (also proved more elegantly here [LW · GW]): if we want to summarize all the information in which is relevant to , then the posterior distribution is a sufficient statistic. It’s also “minimal” in the sense that the posterior distribution can be calculated from any other sufficient statistic, therefore any other sufficient statistic must contain at least as much information (this can be proven via the Data Processing Inequality).
So, given our hypothetical fixed-dimensional summary statistic (where is a vector of all the data points), we can write
… for some function . The right-hand side is a data structure containing the values of for EVERY -value; we can think of it as a vector indexed by . (In the case where has possible values , any map with real-valued can be represented as a length- vector in which is .)
Now, the key property underlying (this version of) the KPD is conditional independence of , i.e. . We want to leverage that factorization, so we’ll transform the distribution into a representation we can factor. First, transform the distribution to a log-odds representation:
… for some arbitrary reference parameters . Next, we apply Bayes’ Rule:
… then subtract off the prior from both sides, since it’s not a function of :
To clean up the notation, we’ll define . Finally, we can apply the factorization (remember, that was the point of all these transformations) and get
The right-hand side of this expression is a sum of vectors indexed by . Each vector is a function of just one of the , and the left-hand side says that we can express the sum in terms of a D-dimensional summary . So, this is an Additive Summary Equation (see appendix below for details). In fact, it’s a particularly simple Additive Summary Equation: each depends only on , and each has no neighbors besides itself.
The Additive Summary Equation Theorem then tells us that the equation is solvable (i.e. a D-dimensional summary statistic exists) only if
… for some consisting of at-most D -indices, and some of dimension at-most D.
Exponentiating both sides and simplifying a bit, this condition becomes
… which is the desired form.
Key Takeaways
The main takeaway from the theorem is the idea of “exception” variables, which circumvent the exponential family form, and the idea that the number of exception variables is limited by the dimensionality of the sufficient statistic.
The main takeaways from the proof are:
- Start with the Minimal Map Theorem to get the “most general” sufficient statistic
- Transform to apply the factorization of the distribution
- Use the Additive Summary Equation Theorem
We’ll reuse this general structure in the next proof.
Causal Model/Bayes Net KPD
Now we address a much more general case: factorization according to a Bayes Net/Causal Model. We still imagine estimating some parameters , but our data is no longer conditionally independent. Conditional on the parameters, interactions between data points are no longer non-existent. However, the direct interactions are sparse - i.e. most data points interact directly with only a few other data points (though they may indirectly interact with everything). This would be the case if, for example, we are trying to estimate parameters of some complicated physical system in real-time from a single instance of the system.
Theorem
Let be continuous random variables, whose distribution conditional on forms a Bayes Net on a DAG (i.e. , where denotes the nodes with edges directly into in the DAG). There exists a D-dimensional sufficient statistic summarizing all information in relevant to only if the distribution has the form
… where:
- are some at-most D-dimensional summary functions
- is a set of “exception” variables, not satisfying the exponential form
- The set of exception variables includes at-most D variables, plus variables in the Markov blanket (i.e. parents, children and parents-of-children) of those D variables, plus children of variables in the Markov blanket.
This mostly generalizes the previous theorem in the obvious way, with one major exception: we can now have a lot more exception variables. Roughly speaking, exceptions happen in “chunks”: neighborhoods each consisting of a variable, its “neighbors” in the Markov blanket, and its neighbors' children. So, in order for the theorem to say anything nontrivial, the DAG has to be sparse: each variable has to have only a few neighbors. If a single variable directly touches most of the DAG, then an exception-neighborhood around that variable could include most of the variables in the model.
As long as the DAG is sparse and D, the vast majority of variables must still satisfy the exponential form. If each variable has at-most variables in its Markov blanket, then the number of exception variables is less than D. For instance, in a minimal computationally-complete circuit model (e.g. NAND gates), (two inputs per node, two children per node to allow copying, and one other parent for each child), so the number of exceptions would be less than 36D. If the number of variables exceeds the summary dimension by several orders of magnitude, this would be a drop in the bucket - the vast majority of variables would still need to satisfy the exponential form.
Proof
This follows broadly the same structure as the proof for the Independent But Non-Identical KPD, so we’ll speed through the parts which are the same and just focus on differences.
As in that proof, we apply the Minimal Map Theorem, then apply a log odds transformation, and factor the distribution. The factorization is now , so the summary equation is
The neighbors of a variable are no longer trivial; they are exactly the Markov blanket of (i.e. parents, children, and children of parents of in the Bayes net). The terms which depend on are the factors corresponding to neighbors plus children-of-neighbors, i.e. . The Additive Summary Equation Theorem then says that a D-dimensional summary exists only if
… where , and is at-most D.
Exponentiating and simplifying a bit yields:
… which is the desired form.
Key Takeaways
The main takeaway here is the idea that exceptional variables occur in neighborhoods, so sparsity is a necessary condition for the theorem to say anything nontrivial. On the other hand, causal models are extremely general [? · GW], sparsity is ubiquitous in the physical sciences [LW · GW], and a sparse causal model is all we need to apply the theorem. We should thus expect exponential family distributions to show up in most places where low-dimensional summaries exist, in practice. In particular, we should expect exponential family forms for most abstractions [? · GW], in practice.
Another important point: the methods used in the proof are “smooth”, in the sense that small violations of the assumptions should produce only small violations of the conclusions. In other words, we should also expect distributions with approximate low-dimensional summaries, which approximately factor as a sparse Bayes Net, to approximately fit the exponential form.
IID KPD
When our variables have some symmetry, we can sometimes leverage it to eliminate the annoying “exception” variables.
The basic requirement is that our distribution is invariant under permuting some of the variables. For instance, if we have a time-symmetric Markov Chain, then we can replace every with , and the distribution remains the same. In the case of IID variables, we can swap any two variables and , and the distribution remains the same, i.e.
If we can swap exception variables with non-exception variables, then we can sometimes eliminate the exceptions altogether.
Theorem
Let be continuous random variables, conditionally independent and identically distributed given (i.e. ). There exists a D-dimensional sufficient statistic summarizing all information in relevant to , with D , if-and-only-if the distribution has the form
… where are some at-most D-dimensional summary functions.
Other than eliminating the exception terms, there are two differences from the independent-but-not-identically-distributed theorem: the ’s are all the same function, and we explicitly require D . When D in the earlier theorem, all variables can be exceptions, so the theorem says nothing. For this theorem, we need at least one non-exceptional variable to swap the other variables with, so we require D .
Note that this theorem is strictly stronger than the original KPD as typically stated, which also applies to the IID case. The original KPD talks about existence of a finite-dimensional summary for an infinite stream of IID variables, whereas this version applies even to a finite set of variables, so long as the dimension of the sufficient statistic is smaller than the number of variables.
Proof
Since this a special case of independent variables, we can start from our independent but non-identical theorem:
Since we assume D , there must be at least one non-exception variable. So, without loss of generality, assume .
For any other variable , we can swap with and integrate out all the other variables to find
… then integrate out :
(Minor note: we’re absorbing constants into from each integral, which is why it keeps gaining primes.)
For variables , we can also swap with in order to replace with . Swapping and integrating as before yields
Substituting both of those into the original expression from the independent but non-identical theorem yields
… which is the desired form.
Key Takeaways
While this is a minor improvement over the original KPD, the real point is to show how symmetry can remove the exception terms.
I expect this to be an especially powerful tool in conjunction with the techniques in Writing Causal Models Like We Write Programs [? · GW]. That post illustrates how to turn recursively-defined functions into recursively-defined causal models. The recursive calls become exactly the sort of symmetries which could potentially be used to remove exception terms in the Causal Model/Bayes Net KPD.
Takeaways Summary
The original Koopman-Pitman-Darmois theorem says that a fixed-dimensional sufficient statistic exists for a stream of IID variables if-and-only-if the distribution is of exponential form.
This generalizes a lot. Neither independence nor identical distribution is necessary; we need only a sparse causal model. However, this generalization comes with a cost: some “exception” variables may not follow the exponential form. As long as the system is sparse, and the number of variables is much larger than the dimension of the sufficient statistic, the number of exception variables will be relatively small, and the vast majority of variables in the system will satisfy the exponential form.
The proofs should also generalize to approximations - systems with approximate sufficient statistics should be approximately exponential family, except for a few variables.
We can also sometimes leverage symmetry to eliminate the exception variables. If the distribution is invariant under some permutation of the variables, then we can try to permute exception variables to non-exception variables, in which case they must follow the exponential form.
Appendix: Additive Summary Equation Theorem
An Additive Summary Equation is a functional equation of the form
… with and unknown, of dimension D , D .
In order to say anything nontrivial about the equation, we need each -term to only depend on a few -components , and each -component to only influence a few -terms. Given some , we can pick out the (indices of) -components which it influences via the expression
We’re also interested in the “neighbors” of , i.e. -components for which some depends on both and . We’ll write the (indices of) ’s neighbors as ; note that we do consider a neighbor of itself.
The post [LW · GW] on the Additive Summary Equation shows that the equation is solvable only if we can choose a set of at-most D components of , called , for which can be expressed as
… where the ’s have output dimension at-most D, is a constant by D matrix, and is a constant vector. In particular, we can choose
,
… for some , with ranging over terms not dependent on , i.e. .
Intuitively, this says that we can’t really say anything about terms dependent on . However, consists of at-most D variables plus their neighbors, so in a large sparse system, hopefully most terms will not depend on . And for all the terms not dependent on , the information content can be aggregated by summation - i.e. the sum .
Note that we may sometimes want to think about whose output dimension is uncountably infinite - i.e. returns a distribution over a continuous variable . In that case, the dot product defining becomes an integral ; everything else remains the same. For simplicity, this post's notation implicitly assumes that has finitely many values, but the generalization is always straightforward.
14 comments
Comments sorted by top scores.
comment by Radford Neal · 2021-07-15T23:30:09.532Z · LW(p) · GW(p)
You've left out the condition for the theorem to be true that the region with non-zero probabilitiy density can't vary with the parameter.
That's how, for example, the uniform distribution over (0,theta) can have a scalar sufficient statistic (the maximum data point) desipite it not being an exponential family model.
Replies from: johnswentworth↑ comment by johnswentworth · 2021-07-15T23:40:14.499Z · LW(p) · GW(p)
Good catch, thanks.
comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2024-06-14T12:02:23.910Z · LW(p) · GW(p)
arbitrary reference parameters .
what is an 'arbitrary reference parameter'? This is not in my vocabulary.
(and why do we need it? can't we just take the log here).
↑ comment by johnswentworth · 2024-06-14T16:30:11.537Z · LW(p) · GW(p)
The nice thing about using odds (or log odds) is that the normalizer cancels out when using Bayes' rule. For a boolean query and data , it looks like this:
- Bayes' rule, usual form: , where is the normalizer
- Bayes' rule, odds form: , where is not-.
That's the usual presentation. But note that it assumes is boolean. How do we get the same benefit - i.e. a form which in which the normalizer cancels out in Bayes' rule - for non-boolean ?
The trick is to choose a reference value of , and compute probabilities relative to the probability of that reference value. For instance, if is a six-sided die roll, I could choose as my reference value, and then I'd represent the distribution as . You can check that, when I update this distribution to , and represent the updated distribution as , the normalizer cancels out just like it does for the odds form on a boolean variable.
That's the trick used for in the post. Applying the trick requires picking an arbitrary value of to use as the reference value (like above), and that's .
comment by David Johnston (david-johnston) · 2021-10-11T03:42:11.829Z · LW(p) · GW(p)
I'm a bit curious about what job "dimension" is doing here. Given that I can map an arbitrary vector in to some point in via a bijective measurable map (https://en.wikipedia.org/wiki/Standard_Borel_space#Kuratowski's_theorem), it would seem that the KPD theorem is false. Is there some other notion of "sufficient statistic complexity" hiding behind the idea of dimensionality, or am I missing something?
Replies from: johnswentworth↑ comment by johnswentworth · 2021-10-11T15:56:32.383Z · LW(p) · GW(p)
There's a smoothness assumption. I assumed differentiability, although that could be weakened somewhat. (The assumption is hidden in the sister post to this one, The Additive Summary Equation [LW · GW].)
comment by rotatingpaguro · 2023-12-24T20:29:53.374Z · LW(p) · GW(p)
In the "Independent But Non-Identical KPD" version, the term is not a sufficient statistic in general, right?
(I could probably figure it out getting into the weeds of the proof, but I did not get it by reading once.)
Replies from: johnswentworth↑ comment by johnswentworth · 2023-12-24T20:47:07.383Z · LW(p) · GW(p)
Correct.
The term along with for all is a sufficient statistic, of dimension at most 2D. So if we assume that the 's are 1-dimensional, then there's a "gap" in the theorem between D and 2D: there exists a summary statistic of dimension D only if the distribution has that form, and if the distribution has that form then and for all together form a summary statistic of dimension at-most 2D.
Replies from: rotatingpaguro↑ comment by rotatingpaguro · 2023-12-24T22:05:23.015Z · LW(p) · GW(p)
I already understood that because you explain it in the text; the further doubt I have is: concerning only the "only if" part, given a -dimensional sufficient statistic exists by assumption, if is also a -dimensional sufficient statistic or not.
I think not, because it should not be able to capture what goes on with the variables, that's hidden in the completely arbitrary term.
This annoys me because I can't see the form of the sufficient statistic like in the i.i.d. case.
comment by Leon Lang (leon-lang) · 2023-01-24T01:51:31.526Z · LW(p) · GW(p)
Summary
An abstraction of a high-dimensional random variable X is a low-dimensional summary G(X) that can be used to make predictions about X. In the case that X is sampled from some parameterized distribution P(X | theta), G(X) may take the form of a sufficient statistic, i.e., a function of X such that P(theta | X) = P(theta | G(X)). To make predictions about X, one may then determine theta from P(theta | G(X)), and predict a new data point X from theta.
In this post, John shows that if you have a very low-dimensional sufficient statistic G(X), then in many cases, X will “almost follow” the exponential family form and thus be fairly easy to deal with.
Further Thoughts
I don’t yet quite understand what John wants to use the theorem for, but I will hopefully learn this in his project update post. My current guess is that he would like to identify abstractions as sufficient statistics of exponential families and that this might be a data structure that is “easier to identify” in the world and in trained machine learning models than our initially “broad guess” for what abstractions could be.
Note that I’m only writing this down to have a prediction to update once I’m reading John’s update post. This seems strictly more useful than not having a prediction at all, even though I don’t place a high chance on my prediction actually being fully correct/nuanced enough.
Another thought is that I feel slightly uneasy about the viewpoint that abstractions are the thing we use to make “predictions about X”. In reality, if a person is in a particular state (meaning that the person is represented by an extremely high-dimensional sample vector X), then to make predictions, I definitely only use a very low-dimensional summary based on my sense of the person’s body part positions and coarse brain state. However, these predictions are not about X: I don’t use the summary to sample vectors that are consistent with the summary; instead, I use the summary to make predictions about summaries themselves. I.e., what will be the person’s mental state and body positions a moment from now, and how will this impact abstractions of other objects in the vicinity of that person? There should then be a “commutative diagram” relating objects in reality to their low-dimensional abstractions, and real-world state transitions to predictions.
I hope to eventually learn more about how this abstraction work feeds into such questions.
comment by harsimony · 2021-07-17T17:36:47.480Z · LW(p) · GW(p)
Is there a generalization of "sufficient statistic" that applies to summaries which grow as the log (or polynomial of the log) in the size of the data?
Despite the fact that your summary keeps growing, this seems like it might also be a useful definition. In the limit of infinite data, a log(n) summary is vanishingly small relative to the size of the data you have.
Replies from: johnswentworth↑ comment by johnswentworth · 2021-07-17T20:25:44.555Z · LW(p) · GW(p)
Good question.
The theorems here should apply directly to that situation. The summary will eventually be lower-dimensional than the data, and that's all we need for these theorems to apply. At any data size n sufficiently large that the O(log n) summary dimension is smaller than the data dimension, the distribution of those n data points must be exponential family.
Replies from: harsimony↑ comment by harsimony · 2021-07-17T22:12:13.473Z · LW(p) · GW(p)
I see! To reiterate: for a fixed n, having a smaller summary dimension means that the distribution of those n points is part of the exponential family.
It seems like there are three cases here:
-
The size of the summary remains constant as number n of samples grows. The distribution of these samples remains constant. Once enough data is collected, you can estimate the summary statistics and infer the overall distribution.
-
The size of the summary grows as O(log(n)). The distribution of these samples changes for each n as the size of the summary grows. You converge on the correct distribution in the limit of infinite n. (This seems weird/incorrect, I might be missing something here. I am trying to think of a concrete example)
-
degenerate case: the size of the required "summary" grows faster than O(n) so you are better off just using the data itself as the summary.
↑ comment by johnswentworth · 2021-07-18T14:31:13.526Z · LW(p) · GW(p)
For the second case, each data point can measure something different, possibly correlated with each other, and related in different ways to the parameters we're trying to estimate. For instance, maybe we're trying to estimate some parameters of a car, so we measure the wheel sizes, axle length, number of gears, engine cylinder volume, etc, etc. Every now and then we measure something which gives us a totally different "kind" of information from the other things we measured - something which forces a non-exponential-family update. When that happens, we have to add a new summary component. Eventually other data points may measure the same "kind" of information and also contribute to that component of the summary. But over time, it becomes more and more rare to measure something which no other data point has captured before, so we add summary dimensions more and more slowly.
(Side note: there's no reason I know why O(log n) growth would be special here; the qualitative story would be similar for any sub-linear summary growth.)