Minimal Motivation of Natural Latents

post by johnswentworth, David Lorell · 2024-10-14T22:51:58.125Z · LW · GW · 1 comments

Contents

  The Main Argument
  Approximation
  Why Is This Interesting?
None
1 comment

Suppose two Bayesian agents are presented with the same spreadsheet - IID samples of data in each row, a feature in each column. Each agent develops a generative model of the data distribution. We'll assume the two converge to the same predictive distribution, but may have different generative models containing different latent variables. We'll also assume that the two agents develop their models independently, i.e. their models and latents don't have anything to do with each other informationally except via the data. Under what conditions can a latent variable in one agent's model be faithfully expressed in terms of the other agent's latents?

Let’s put some math on that question.

The n “features” in the data are random variables . By assumption the two agents converge to the same predictive distribution (i.e. distribution of a data point), which we’ll call . Agent ’s generative model  must account for all the interactions between the features, i.e. the features must be independent given the latent variables  in model . So, bundling all the latents together into one, we get the high-level graphical structure:

which says that all features are independent given the latents, under each agent’s model.

Now for the question: under what conditions on agent 1’s latent(s)  can we guarantee that  is expressible in terms of , no matter what generative model agent 2 uses (so long as the agents agree on the predictive distribution )? In particular, let’s require that  be a function of . (Note that we’ll weaken this later.) So, when is  guaranteed to be a function of , for any generative model  which agrees on the predictive distribution ? Or, worded in terms of latents: when is  guaranteed to be a function of , for any latent(s)  which account for all interactions between features in the predictive distribution ?

The Main Argument

 must be a function of  for any generative model  which agrees on the predictive distribution. So, here’s one graphical structure for a simple model  which agrees on the predictive distribution:

In English: we take  to be , i.e. all but the  feature. Since the features are always independent given all but one of them (because any random variables are independent given all but one of them),  is a valid choice of latent . And since  must be a function of  for any valid choice of , we conclude that  must be a function of . Graphically, this implies

By repeating the argument, we conclude that the same must apply for all :

Now we’ve shown that, in order to guarantee that  is a function of  for any valid choice of , and for  to account for all interactions between the features in the first place,  must satisfy at least the conditions:

… which are exactly the (weak) natural latent conditions [LW · GW], i.e.  mediates between all ’s and all information about  is redundantly represented across the ’s. From the standard Fundamental Theorem of Natural Latents [LW · GW], we also know that the natural latent conditions are almost sufficient[1]: they don’t quite guarantee that  is a function of , but they guarantee that  is a stochastic function of , i.e.  can be computed from  plus some noise which is independent of everything else (and in particular the noise is independent of ).

… so if we go back up top and allow for  to be a stochastic function of , rather than just a function, then the natural latent conditions provide necessary and sufficient conditions for the guarantee which we want.

Approximation

Since we’re basically just invoking the Fundamental Theorem of Natural Latents, we might as well check how the argument behaves under approximation.

The standard approximation results allow us to relax both the mediation and redundancy conditions. So, we can weaken the requirement that the latents exactly mediate between features under each model to allow for approximate mediation, and we can weaken the requirement that information about  be exactly redundantly represented to allow for approximately redundant representation. In both cases, we use the KL-divergences associated with the relevant graphs in the previous sections to quantify the approximation. The standard results then say that  is approximately a stochastic function of , i.e.  contains all the information about  relevant to  to within the approximation bound (measured in bits).

The main remaining loophole is the tiny mixtures problem: arguably-small differences in the two agents’ predictive distributions can sometimes allow large failures in the theorems. On the other hand, our two hypothetical agents could in-principle resolve such differences via experiment, since they involve different predictions.

Why Is This Interesting?

This argument was one of our earliest motivators for natural latents. It’s still the main argument we have which singles out natural latents in particular - i.e. the conclusion says that the natural latent conditions are not only sufficient for the property we want, but necessary. Natural latents are the only way to achieve the guarantee we want, that our latent can be expressed in terms of any other latents which explain all interactions between features in the predictive distribution.

  1. ^

    Note that, in invoking the Fundamental Theorem, we also implicitly put weight on the assumption that the two agents' latents have nothing to do with each other except via the data. That particular assumption can be circumvented or replaced in multiple ways - e.g. we could instead construct a new latent via resampling, or we could add an assumption that either  or  has low entropy given .

1 comments

Comments sorted by top scores.

comment by ryan_greenblatt · 2024-10-15T00:53:55.478Z · LW(p) · GW(p)

The setup here implies a empirical (but conceptually tricky) research direction: try to take two different AIs trained to both do the same prediction task (e.g. predict next tokens of webtext) and try to correspond their internal structure in some way.

It's a bit unclear to me what the desiderata for this research should be. I think we ideally want something like a "mechanistic correspondence", something like a heuristic argument [LW · GW] that the two models produce the same output distribution when given the same input.

Back when Redwood was working on model internals and interp, we were somewhat excited about trying to do something along these lines. Probably something trying to use automated methods to do a correspondence that seems accurate based on causal scrubbing [LW · GW].

(I haven't engaged much with this post overall, I just thought this connection might be interesting.)