Proof idea: SLT to AIT

lblack

Proof idea: SLT to AIT

post by Lucius Bushnaq (Lblack) · 2025-02-10T23:14:24.538Z · LW · GW · 15 comments

  Background:
  Proof idea outline:
    Setup: Predicting a stochastic process
    Claims I want to prove:
  Comments:
None
15 comments

I think we may be able to prove that Bayesian learning on transformers^[1] or recurrent neural networks with a uniform^[2] prior over parameters is equivalent to a form of Solomonoff induction [? · GW] over a set of computationally-bounded programs. This bounded Solomonoff induction would still be 'approximately optimal' in a sense, being able to predict the data about as well any other bounded prediction procedure included in the set of programs it runs over. This proof would link Singular Learning Theory [? · GW] (SLT) back to basic Algorithmic Information Theory (AIT).

This post is my current early-stage sketch of the proof idea. Don't take it too seriously yet. I’m writing this out mostly to organise my own thoughts. I'd originally planned for it to be a shortform, but I think it ended up a bit too long for that.

Background:

I recently held a small talk presenting an idea for how and why deep learning generalises. Slides for the talk here, slide discussion here [LW · GW].

In the talk, I tried to reduce concepts from SLT^[3] back to AIT^[4]. I sketched a story about deep learning, or perhaps even learning more generally, that goes like this:

Bayesian learning on (recurrent) neural networks is equivalent to a form of Solomonoff induction running over a set of programs bounded in length, runtime and memory usage.
Using SGD/genetic algorithms/your-fancy-update-method-of-choice to train a neural network is then a cheap bargain bin^[5] approximation of Bayesian learning on the neural network. Training steps are biased to make simple updates rather than complex updates because exponentially more parameter configurations in the architecture correspond to simpler programs.

Now, I want to actually prove this story. Specifically, I want to prove the first part: That Bayesian learning on transformers or RNNs is equivalent to a computationally bounded form of Solomonoff Induction (SI), in a sense I want to make precise. I also want to show that this bounded SI is a sensible approximation of actual unbounded SI, given an assumption that the data is 'efficiently predictable'. That is, we assume that it is possible at all to predict the data to good-enough accuracy with a limited computational budget. See e.g. these posts [1 [LW · GW],2 [LW · GW],3 [LW · GW]] if you want some idea of why we might think that this is a reasonable assumption to make for a lot of real-world data.^[6]

In the following, I’ll sketch a skeleton of the proof as I envision it right now, dividing it up into individual claims that seem tractable to prove to me. I’ve only just started on this, so some of the details here are probably wrong. Other details of the setup might just turn out to be hard to work with and get changed out for something better.

Proof idea outline:

Setup: Predicting a stochastic process

We want to learn to predict some stochastic process from inputs to outputs $y$ . We'll model the process as some function $f$ that takes two inputs: One input $x$ that we get to see, and one input $h$ we don't get to see, and returns one output $y$ . $h$ is included to represent potential randomness in the process. It is drawn from some distribution with entropy $H (h)$ .

Specifically, we define $f : (x, h) \to y$ , where $f$ is a function, $x \in {0, 1}^{n}$ and $h \in {0, 1}^{n^{'}}$ are both tuples of $n, n^{'}$ booleans respectively, and $y \in {0, 1}$ is a single boolean. We get to observe many pairs $(x, y)$ .

Our prediction for $y$ will be probabilistic. Without loss of generality, we will make our prediction by naming the probability $p (y | x)$ that $y$ is a $1$ rather than a $0$ . We’re about to think about this with programs, so we’ll restrict our $p$ to be floating point numbers of some finite precision $b$ .

Claims I want to prove:

Claim 1: The ‘best’ way to make this prediction is to run a version of Solomonoff induction. By best, I mean we get a guarantee about the total error of our prediction ala Solomonoff completeness. Specifically, we take a Universal Turing Machine(UTM) $M_{1}$ , which accepts the input $x$ and some program $P$ that outputs $b$ bits, representing the probability $p (y | x)$ . We then run Solomonoff induction (SI) over the set of all such programs $P$ to make our prediction. The total error of our prediction in terms of KL-divergence measured in bits, across $D$ data points, should then be bounded below $\leq C (f, M_{1}) + D H (h)$ , where $C (f, M_{1})$ is the length of the shortest program that implements $f$ on the UTM $M_{1}$ and $H (h)$ is the entropy of our random variable $h$ .

Proof status: This is just a slight variation of the textbook SI setup I am familiar with. I hope a proof of this already exists.

Claim 2: Solomonoff induction over length-bounded UTMs can still make approximately optimal predictions if $f$ is simple. Say the function $f$ has a description length of $K$ bits in our UTM $M_{1}$ . If we know or heavily suspect that $K << N$ , we can run a version of SI that excludes programs above $N$ bits in length, and the result will still be very close to optimal in the sense described in claim 1. This length-bounded version would use a UTM $M_{2}$ that accepts plain program codes, not prefix-free program codes. It’d runs over all possible programs of length $N$ , and assign a uniform prior $2^{- N}$ to every program.

Proof status: I’d guess this proof might also exist somewhere already. If not, I think we can show it? See the discussion here [LW(p) · GW(p)] and here [LW(p) · GW(p)]. Also, I’ve since been pointed to the central lemma I think you’d want for this in this book on pages 145-146.

Claim 3: If we further limit our Solomonoff induction to only include programs of runtime $T$ and space $s$ , our prediction can still be approximately optimal in a sense. For this, we will assume that $f (x, h) = y$ is ‘efficiently predictable’. By ‘efficiently predictable’, we mean that there is some program $P^{*}$ of length $| P^{*} | < N$ , that could be instantiated on $M_{2}$ , requiring $s$ space and $T$ runtime, which makes predictions $p (y | x)$ that we consider ‘good-enough’ for our purposes. That is, the expected KL-divergence of its predictions from the actual distribution of $y$ is small enough that we would be happy with it. The time and space bounded SI is then approximately optimal in the sense that its total error compared to this efficient predictor, as measured by KL-divergence from the predictions made by $P^{*}$ , will be $\leq | P^{*} |$ bits summed across all data points.^[7]

Proof status: Seems doable? We can just walk through the same style of argument used to show Solmonoff completeness for unbounded SI: Our SI will catch up to any such $P^{*}$ about as fast as programs simpler than $P^{*}$ are excluded by the training data.

Claim 4: Our bounded Solomonoff Induction is not very sensitive to the UTM we run it on. Just as in regular Solomonoff induction. If we use another UTM $M_{3}$ that can be implemented in the previous UTM $M_{2}$ using $N^{'} << N$ bits, bounded SI on $M_{3}$ restricted to programs $N + N^{'}$ bits long or more, with $O (s T)$ runtime and $O (s)$ space, will still be approximately-optimal in the sense specified in claim 3, though our total error across all data compared to the efficient predictor will now be bounded to $\leq | P^{*} | + N^{'}$ bits instead of $\leq | P^{*} |$ bits.

Proof status: I think we can just use existing results on efficient simulation to show this?

Claim 5: Transformers and RNNs are equivalent to a bounded UTM for our purposes. Consider e.g. a transformer in inference mode, with context window $T$ and $O (N)$ parameters. At the end of each forward pass, its output is fed back into its input layer. This happens for a total of $T$ forward passes. On the first forward pass, it is passed the input data $x$ , padding the rest of the input with zeroes. On the final forward pass, its output is interpreted as the probability $p (y | x)$ . This transformer is equivalent to a bounded UTM accepting programs of length $O (N)$ , with a maximum runtime of $O (T)$ , with a space bound determined by $T$ and the residual stream width.

Proof status: Some proofs along these lines already exist [1,2], though they seem to have minor problems. Mostly, I'm peeved that they use hardmax instead of softmax in the attention. This seems like a choice made for purely technical reasons though, they just didn't want a function that doesn't return rational numbers because that's annoying for their proof.

Claim 6: Therefore, Bayesian learning on a transformer or RNN is a form of bounded Solomonoff induction, with the approximate optimality properties of the bounded induction outlined in claims 3 and 4. Since SLT^[3] is the theory describing Bayesian learning on neural networks, we have now shown SLT to be describing a form of Solomonoff induction.

Proof status: Immediately follows if we prove the other five claims.

If anyone points out errors or confusions in any of these claims or proof ideas, I'd appreciate it a lot.

Comments:

On the meaning of the learning coefficient: Since the SLT^[3] posterior would now be proven equivalent to the posterior of a bounded Solomonoff induction, we can read off how the (empirical) learning coefficient $λ$ in SLT relates to the posterior in the induction, up to a conversion factor equal to the floating point precision of the network parameters.^[8] This factor is there because SLT works with real numbers whereas AIT^[4] works with bits. Also, note that for non-recursive neural networks like MLPs, this proof sketch would suggest that the learning coefficient is related to something more like circuit complexity than program complexity. So, the meaning of $λ$ from an AIT perspective depends on the networks architecture. It's (sort of)^[9] K-complexity related for something like an RNN or a transformer run in inference mode, and more circuit complexity related for something like an MLP.

On real world efficiency: This would be an AIT^[4]-style proof. None of this says very much about efficiency in the real world. It is not telling us whether we ought to use transformers or RNNs or Mamba for maximum efficiency, ReLUs or sigmoids, or anything of the kind. It's just showing, starting from an AIT perspective, why anything in a superset that includes all of these options would be able to do learning at all.

On wider applicability: While the proof will be for deep learning on recurrent neural networks, I think the implications here could be much broader than that. I think this story of how learning is possible at all might also apply to in-context learning, and perhaps even many alternate setups outside of ‘deep learning’ that have not been invented yet.

Thanks to Stefan Heimersheim for proofreading and comments. Thanks to Dmitry Vaintrob, Kaarel Hänni and Linda Linsefors for discussion. Thanks again to all of these [LW(p) · GW(p)] people. Thanks to everyone else at Apollo who endured my rants about this.

^{^}
Transformers run in inference mode. So, imagine something like RL training, but with Bayesian updating instead of gradient descent.
^{^}
I don't think using a Gaussian or some other typical NN initialisation scheme instead would change much, but I plan to prove it for a uniform distribution first because that seems more straightforward.
^{^}
Singular Learning Theory
^{^}
Algorithmic Information Theory
^{^}
Some things Bayesian learning can do SGD just can't do. See e.g. the parity learning problem [LW · GW].
^{^}
Also, observe how a lot of the real world does not seem like completely unpredictable tv static to your computationally limited brain.
^{^}
Credit to the logical induction paper for inspiring this idea. Seeing their proof that LIA isn’t exploitable by other traders than run in polynomial time is what got me thinking about this.
^{^}
Ok, really $λ$ should probably be related to the posterior of the induction on a UTM by first dividing it by the floating point precision of the neural network parameters, then letting that floating point precision tend to infinity.
^{^}
Only sort-of related because our prior only has support on programs bounded in runtime, space and description length, not on all programs. Also, K-complexity and (semi)-measure in the Solomonoff prior aren't identical [LW(p) · GW(p)] even in the conventional unbounded version of Solomonoff induction.

15 comments

Comments sorted by top scores.

comment by Daniel Murfet (dmurfet) · 2025-03-05T05:54:57.326Z · LW(p) · GW(p)

I don't believe Claim 6 is straightforward. Or to be more precise, the closest detailed thing I can see to what you are saying does not obviously lead to such a relation.

I don't see any problem with the discussion about Transformers or NNs or whatever as a universal class of models. However then I believe you are discussing the situation in Section 3.7.2 of Hutter's "Universal artificial intelligence", also covered in his paper "On the foundations of universal sequence prediction" (henceforth Hutter's book and Hutter's paper). The other paper I'm going to refer to below is Sterkenburg's "Solomonoff prediction and Occam's razor". I know a lot of what I write below will be familiar to you, but for the sake of saying clearly what I'm trying to say, and for other readers, I will provide some background.

Background

A reminder on Hutter's notation: we have a class of semi-measures over sequences $x \in X^{*}$ which I'll assume satisfy $ρ (x) = \sum_{a \in X} ρ (x a)$ so that we interpret $ρ (x)$ as the probability that a sequence starts with $x$ . Then $μ = μ_{θ_{0}}$ denotes the true generating distribution (we assume this is in the class, e.g. in the case where $Θ$ parametrises Turing machine codes, that the environment is computable). There are data sequences $x_{1 : n} = x_{1} \dots x_{n}$ and the task is to predict the next token $x_{n + 1}$ , we view $ρ (a | x) := ρ (x a) / ρ (x)$ as the probability according to $ρ$ that the next symbol is $a$ given $x$ .

If $M$ is countable and we have weights $w_{ρ}$ for each $ρ \in M$ satisfying $\sum_{ρ \in M} w_{ρ} \leq 1, w_{ρ} > 0$ then we can form the Bayes mixture $ξ (x) = \sum_{ρ \in M} w_{ρ} ρ (x)$ . The gap between the predictive distribution associated to this mixture and the true distribution is measured by

$\sum_{t = 1}^{n} E [s_{t}] \leq D_{n} (μ | | ξ) = E [\frac{μ (x_{1 : n})}{ξ (x_{1 : n})}]$

where $s_{t}$ is for example the KL divergence between $μ (- | x_{< t})$ and $ξ (- | x_{< t})$ . It can be shown that $D_{n} (μ | | ξ) \leq log (w_{μ}^{- 1})$ and hence this also upper bounds $\sum_{t = 1}^{\infty} E [s_{t}]$ which is the basis for the claim that the mixture converges rapidly (in $n$ ) to the true predictive distribution provided of course the weight is nonzero for $μ$ . This is (4) in Hutter's paper and Theorem 3.19 in his book.

It is worth noting that

$ξ (a | x_{< t}) = \frac{ξ (x_{< t} a)}{ξ (x_{< t})} = \sum_{ρ \in M} \frac{w_{ρ} ρ (x_{< t} a)}{\sum_{τ \in M} w_{τ} τ (x_{< t})}$

which can be written as

$ξ (a | x_{< t}) = \sum_{ρ \in M} w_{ρ | σ} ρ (a | x_{< t})$

where $w_{ρ | x_{< t}} = \frac{w_{ρ} ρ (x_{< t})}{\sum_{τ \in M} w_{τ} τ (x_{< t})}$ is the posterior distribution. In this sense $ξ$ is the Bayesian posterior predictive distribution given $x_{< t}$ .

Note this doesn't really depend on the choice of weights (i.e. the prior), provided the environment is in the hypothesis class and is given nonzero weight. This is an additional choice, and one can motivate on various grounds the choice of weights $w_{ρ} = 2^{- K (ρ)}$ with Kolmogorov complexity $K$ of the hypothesis $ρ$ . When one makes this choice, the bound becomes

$\sum_{t = 1}^{\infty} E [s_{t}] \leq log (2^{K (μ)}) = K (μ) log 2$ .

As Hutter puts it on p.7 of his paper "the number of times $ξ$ deviates from $μ$ by more than $ε > 0$ is bounded by $O (K (μ))$ , i.e. is proportional to the complexity of the environment". The above gives the formal basis for (part of) why Solomonoff induction is "good". When you write

The total error of our prediction in terms of KL-divergence measured in bits, across $D$ data points, should then be bounded below $\leq C (f, M_{1}) + D H (h)$ , where $C (f, M_{1})$ is the length of the shortest program that implements $f$ on the UTM

I believe you are referring to some variant of this result, at least that is what Solomonoff completeness means to me.

It is important to note that these two steps (bounding $D_{n}$ above for any weights, and choosing the weights ala Solomonoff to get a relation to K-complexity) are separable, and the first step is just a general fact about Bayesian statistics. This is covered well by Sterkenburg.

Continuous model classes

Ok, well and good. Now comes the tricky part: we replace $Θ$ by an uncountable set and try to replace sums by integrals. This is addressed in Section 3.7.2 of Hutter's book and p.5 of his paper. This treatment is only valid insofar as Laplace approximations are valid, and so is invalid when $Θ$ is a class of neural network weights and $μ_{θ}$ involves predicting sequences based on those networks in such a way that degeneracy is involved. This is the usual setting in which classical theory fails and SLT is required. Let us look at the details.

Under a nondegeneracy hypothesis (Fisher matrix $_{n}$ invertible at the true parameter) one can prove (this is in Clarke and Barron's "Information theoretic asymptotics of Bayes methods" from 1990, also see Balasubramanian's "Statistical Inference, Occam's razor, and Statistical Mechanics on the Space of Probability Distributions" from 1997 and Watanabe's book "Mathematical theory of Bayesian statistics", aka the green book) that

$D_{n} (μ | | ξ) \leq log (w (μ)^{- 1}) + \frac{d}{2} log (\frac{n}{2 π}) + \frac{1}{2} log det (_{n}) + o (1)$

where now $ξ (x) = \int_{Θ} d θ w (θ) μ_{θ} (x)$ and $Θ \subseteq R^{d}$ has nonempty interior, i.e. is actually $d$ dimensional. Here we see that in addition to the $log (w (μ)^{- 1})$ terms from before there are terms that depend on $n$ . The log determinant term is treated a bit inelegantly in Clarke and Barron and hence Hutter, one can do better, see Section 4.2 of Watanabe's green book, and in any case I am going to ignore this as a source of $n$ dependence.^[1]

Note that there is now no question of bounding $\sum_{t = 1}^{\infty} E [s_{t}]$ since the right hand side (the bound on $D_{n} (μ | | ξ)$ ) increases with $n$ . Hutter argues this "still grows very slowly" (p.4 of his paper) but this seems somewhat in tension with the idea that in Solomonoff induction we sometimes like to think of all of human science as the context when we predict the next token ( $n$ here being large). This presents a conceptual problem, because in the countable case we like to think of K-complexity as an important part of the story, but it "only" enters through the choice of weight and thus the $log (w_{μ}^{- 1})$ term in the bound on $D_{n} (μ | | ξ)$ , whereas in the continuous case this constant order term in $n$ may be very small in comparison to the $log (n)$ term.

In the regular case (meaning, where the nondegeneracy of the Fisher information at the truth holds) it's somewhat reasonable to say that at least the $log (n)$ term doesn't know anything about the environment (i.e. the coefficient $\frac{d}{2}$ depends only on the parametrisation) and so the only environment dependence is still something that involves $K (μ)$ , supposing we (with some normalisation) took $w (μ)$ to have something to do with $K (μ)$ . However in the singular case as we know from SLT, the appropriate replacement^[2] for this bound has a coefficient $λ (μ)$ of $log (n)$ which also depends on the environment. I don't see a clear reason why we should care primarily about a subleading term (the K-complexity, a constant order term in the asymptotic expansion in $n$ ) over the leading term (we are assuming the truth is realisable, so there is no order $n$ term).

That is, as I understand the structure of the theory, the centrality of K-complexity to Solomonoff induction is an artifact of the use of a countable hypothesis class. There are various paragraphs in e.g. Hutter's book Section 3.7.2 and p.12 of his paper which attempt to dispel the "critique from continuity" but I don't really buy them (I didn't think too hard about them, and only discussed them briefly with Hutter once, so I could be missing something).

Of course it is true that there is an $n$ large enough that the behaviour of the posterior distribution over neural networks "knows" that it is dealing with a discrete set, and can distinguish between the closest real numbers that you can represent in floating point. For values of $n$ well below this, the posterior can behave as if it is defined over the mathematically idealised continuous space. I find this no more controversial than the idea that sound waves travelling in solids can be well-described by differential equations. I agree that if you want to talk about "training" very low precision neural networks maybe AIT applies more directly, because the Bayesian statistics that is relevant is that for a discrete hypothesis class (this is quite different to producing quantised models after the fact). This seems somewhat but not entirely tangential to what you want to say, so this could be a place where I'm missing your point. In any case, if you're taking this position, then SLT is connected only in a very trivial way, since there are no learning coefficients if one isn't talking about continuous model classes.

To summarise: K-complexity usually enters the theory via a choice of prior, and in continuous model classes priors show up in the constant order terms of asymptotic expansions in $n$ .

From AIT to SLT

You write

On the meaning of the learning coefficient: Since the SLT^[3] posterior would now be proven equivalent to the posterior of a bounded Solomonoff induction, we can read off how the (empirical) learning coefficient $λ$ in SLT relates to the posterior in the induction, up to a conversion factor equal to the floating point precision of the network parameters.^[8] This factor is there because SLT works with real numbers whereas AIT^[4] works with bits. Also, note that for non-recursive neural networks like MLPs, this proof sketch would suggest that the learning coefficient is related to something more like circuit complexity than program complexity. So, the meaning of $λ$ from an AIT perspective depends on the networks architecture. It's (sort of)^[9] K-complexity related for something like an RNN or a transformer run in inference mode, and more circuit complexity related for something like an MLP.

I don't claim to know precisely what you mean by the first sentence, but I guess what you mean is that if you use the continuous class $M$ of predictors with $Θ$ parametrising neural network weights, running the network in some recurrent mode (I don't really care about the details) to make the predictions about sequence probabilities, then you can "think about this" both in an SLT way and an AIT way, and thus relate the posterior in both cases. But as far as I understand it, there is only one way: the Bayesian way.

You either think of the NN weights as a countable set (by e.g. truncating precision "as in real life") in which case you get something like $\sum_{t = 1}^{\infty} E [s_{t}] \leq log (w_{μ}^{- 1})$ but this is sort of weak sauce: you get this for any prior you want to put over your discrete set of NN weights, no implied connection to K-complexity unless you put one in by hand by taking $w_{μ} = 2^{- K (μ)}$ . This is legitimately talking about NNs in an AIT context, but only insofar the existing literature already talks about general classes of computable semi-measures and you have described a way of predicting with NNs that satisfies these conditions. No relation to SLT that I can see.

Or you think of the NN weights as a continuous set in which case the sums in your Bayes mixture become integrals, the bound on $D_{n} (μ | | ξ)$ becomes more involved and must require an integral (which of course has the conceptual content of "bound $ξ (x_{1 : n})$ below by contributions from a neighbourhood of $μ (x_{1 : n})$ and that will bound $D_{n}$ above by something to do with $μ$ " either by Laplace or more refined techniques ala Watanabe) and you are in the situation I describe above where the prior (which you can choose to be related to $2^{- K}$ if you wish) ends up in the constant term and isn't the main determinant of what the posterior distribution, and thus the distance between the mixture and the truth, does.

That is, in the continuous case this is just the usual SLT story (because both SLT and AIT are just the standard Bayesian story, for different kinds of models with a special choice of prior in the latter case) where the learning coefficient dominates how the Bayesian posterior behaves with $n$ .

So for there to be a relation between the K-complexity and learning coefficient, it has to occur in some other way and isn't "automatic" from formulating a set of NNs as like codes for a UTM. So this is my concern about Claim 6. Maybe you have a more sophisticated argument in mind.

Free parameters and learning coefficients

In a different setting I do believe there is such a relation, Theorem 4.1 of Clift-Murfet-Wallbridge (2021) as well as Tom Waring's thesis make the point that unused bits in a TM code are a form of degeneracy, when you embed TM codes into a continuous space of noisy codes. Then the local learning coefficient at a TM code is upper bounded by something to with its length (and if you take a function and turn it into a synthesis problem, the global learning coefficient will therefore be upper bounded by the Kolmogorov complexity of the function). However the learning coefficient contains more information [LW · GW] in this case.

Learning coefficient vs K

In the NN case the only relation between the learning coefficient of a network parameter $θ$ and the Kolmogorov complexity $K (θ)$ that I know follows pretty much immediately from Theorem 7.1 (4) of Watanabe's book (see also p.5 of our paper https://arxiv.org/abs/2308.12108). You can think of the following as one starting point of the SLT perspective on MDL, but my PhD student Edmund Lau has a more developed story in his PhD thesis.

Let $θ^{*}$ be a local minimum of a population loss $L$ (I'm just going to use the notation from our paper) and define for some $ε > 0$ the quantity $V (ε)$ to be the volume of the set ${θ : | L (θ) - L (θ^{*}) | < ε}$ , regularised to be in some ball and with some appropriately decaying measure if necessary, none of this matters much for what I'm going to say. Suppose we somehow produce a parameter $θ_{0} \in V (ε)$ , the bit cost of this is ignored in what follows, and that we want to refine this to a parameter $θ^{near}$ within an error tolerance $η$ (say set by our floating point precision) that is, $θ^{near} = θ_{n} \in V (η)$ by taking a sequence of increasingly good approximations

$θ_{0}, θ_{1}, \dots, θ_{i}, θ_{i + 1}, \dots, θ_{n}$

such that at each stage, $θ_{i} \in V (\frac{ε}{2^{i}})$ and $\frac{ε}{2^{n}} = η$ , so $n = {log}_{2} (\frac{ε}{η})$ . The aforementioned results say that (ignoring the multiplicity) the bit cost of each of these refinements is approximately the local learning coefficient $λ (θ^{*})$ . So the overall length of the description of $θ_{n}$ given $θ_{0}$ done in this manner is $λ (θ^{*}) {log}_{2} (\frac{ε}{η})$ . This suggests

$K (θ^{near} | θ_{0}) \leq λ (θ^{*}) {log}_{2} (\frac{ε}{η})$

We can think of $M = {log}_{2} (\frac{ε}{η})$ as just the measure of the number of orders of magnitude covered by our floating point representation for losses.

^{^}
There is arguably another gap in the literature here. Besides this regularity assumption, there is also the fact that the main reference Hutter is relying on (Clarke and Barron) works in the iid setting whereas Hutter works in the non-iid setting. He sketches in his book how this doesn't matter and after thinking about it briefly I'm inclined to agree in the regular case, I didn't think about it generally. Anyway I'll ignore this here since I think it's not what you care about.
^{^}
Note that SLT contains asymptotic expansions for the free energy, whereas $D_{n} (μ | | ξ)$ looks more like the KL divergence between the truth and the Bayesian posterior predictive distribution, so what I'm referring to here is a treatment of the Clarke-Barron setting using Watanabe's methods. Bin Yu asked Susan Wei (a former colleague of mine at the University of Melbourne, now at Monash University) and I if such a treatment could be given and we're working on it (not very actively, tbh).

Replies from: Lblack

↑ comment by Lucius Bushnaq (Lblack) · 2025-03-05T07:35:52.364Z · LW(p) · GW(p)

You either think of the NN weights as a countable set (by e.g. truncating precision "as in real life") in which case you get something like but this is sort of weak sauce: you get this for any prior you want to put over your discrete set of NN weights, no implied connection to K-complexity unless you put one in by hand by taking $w_{μ} = 2^{- K (μ)}$ .

No, you don't need to put in $w_{μ} = 2^{- K (μ)}$ by hand. A uniform prior over NN weights does the job.^[1]

The trick is that a transformer run in recurrent mode can

Simulate a (time and space bounded) UTM in a few transformer blocks
Use the other transformer blocks to store program code to feed that UTM as input.

A uniform prior over neural network parameters then effectively implies a uniform prior over programs to run on the simulated UTM, modulo the bit specification cost of the UTM simulator and the storage setup. Because for every bit of program code less we need to store, we get degrees of freedom in the weights.
Since induction with a uniform prior on the input strings to a plain monotone UTM effectively gets us a weighting of hypotheses that’s exponential in K-complexity, we'll get an error bound with a term proportional to $K (μ)$ , plus an offset term for specifying the UTM and storage in the transformer weights.

For the sake of concreteness: If I partially adapted your notation, and went to the special case where the data-generating process $μ$ is exactly realisable in the weights of the transformer^[2], I'd currrently seem to get a bound $D (μ | | ξ) \leq f log 2 K (M, F) + f log 2 (\frac{9}{2} + \frac{5}{2 d}) K (μ, M) + O (1)$ .^[3]

Here, $f$ is the number of bits per neural network parameter^[4], $K (M, F)$ is the number of parameters needed to implement the UTM $M$ on the recurrent transformer architecture $F$ , $K (μ, M)$ is the K-complexity of data-generating program $μ$ on the UTM $M$ in bits, and $d$ is the width of the residual stream.

The prefactor $(\frac{9}{2} + \frac{5}{2 d})$ is there because my current construction is stupidly inefficient at storing program code in the weights. I think it ought to be possible to do better, and get this down to a 1. Don't quote me on that though, I don't have a proof yet.

If we don't assume realisability, we can instead take any 'efficient predictor' program $p^{*}$ that is realisable on the transformer, and get the bound

$D (μ | | ξ) \leq D (μ | | p^{*}) + f log 2 K (M, F) + f log 2 (\frac{9}{2} + \frac{5}{2 d}) K (p^{*}, M) + O (1)$ .

So to summarise

K-complexity usually enters the theory via a choice of prior, and in continuous model classes priors show up in the constant order terms of asymptotic expansions in $n$ .

The result here is exactly that we don't need to put in the K-complexity^[5] via choice of prior. If we're using a recurrent neural network, the K-complexity is in the prior already, just as it is on a plain monotone UTM. The architecture itself is what implements the bias toward simplicity.

Note also that in the case of continuous parameters, so bits per float $f$ going to infinity, the K-complexity terms in the bound do not become constant order terms, because they have $f$ as a prefactor. This is one way to start seeing that the K-complexity and the learning coefficient are pretty directly related quantities in the setting of recurrent neural networks.

^{^}
I expect a Gaussian prior or anything else of the sort probably works as well, and yields a nigh-identical bound. But I haven’t shown that yet.^[6]
^{^}
My actual bound doesn't need that assumption. Getting rid of the realisability assumption is what the effective predictor stuff is all about.
^{^}
The $O (1)$ terms become increasingly irrelevant as float precision $f$ gets larger. Basically, I'm using large negative biases to zero out storage neurons that are not needed. In the continuum limit, this would make the weights connecting to those neurons degenerate, and we could integrate them out of the measure. But since we're in the discrete setting, we have to keep track of the fact that very large magnitudes of the weights that overwhelm the negative biases and switch the neuron on again aren't allowed. This makes our volume of allowed parameter configurations just a little bit smaller.
^{^}
So, $f = 8$ for 8-bit floats, $f = 16$ for 16-bit floats, etc. .
^{^}
Defined relative to a time and space bounded universal Turing machine.
^{^}
EDIT: As in I haven't shown it in the case of finite float precision NN parameters yet. It of course straightforwardly follows in the SLT setting where NN parameters are real numbers and we consider the limit of number of datapoints going to infinity. The shape of the prior can't matter much there, as you say.

Replies from: dmurfet

↑ comment by Daniel Murfet (dmurfet) · 2025-03-05T11:29:26.356Z · LW(p) · GW(p)

Where is here?

Replies from: Lblack

↑ comment by Lucius Bushnaq (Lblack) · 2025-03-05T11:43:29.617Z · LW(p) · GW(p)

Doesn't exist.^[1] If is finite, you can insert AIT-style inequalities into the posterior to get bounds like the one I wrote above. This is neat if you e.g. have more than $2^{f}$ datapoints.

If f is infinite, you probably want to expand in $n$ instead. I haven't done that yet, but I expect to get a bound that looks a lot like the standard free energy formula, with the K-complexity terms in the bound I wrote above showing up where the learning coefficient would usually be. The $f$ prefactor probably gets swapped out for a $\frac{log n}{2}$ .

It'd still be an upper bound, not an equality, just as in AIT. The learning coefficient can still be smaller than this. This makes sense to me. There might be less complicated ways for the transformer to make an efficient prediction than simulating a UTM and running some program on it.

^{^}
Except for the implicit dependence in $D (μ | | ξ)$ and $D (μ | | p^{*})$ , since those are the KL-divergences summed over $n$ datapoints.

Replies from: dmurfet

↑ comment by Daniel Murfet (dmurfet) · 2025-03-10T06:39:09.865Z · LW(p) · GW(p)

Ok makes sense to me, thanks for explaining. Based on my understanding of what you are doing, the statement in the OP that in your setting is "sort of" K-complexity is a bit misleading? It seems like you will end up with bounds on $D (μ | | ξ)$ that involve the actual learning coefficient, which you then bound above by noting that un-used bits in the code give rise to degeneracy. So there is something like $λ \leq K$ going on ultimately.

If I understand correctly you are probably doing something like:

Identified a continuous space $W$ (parameters of your NN run in recurrent mode)
Embedded a set of Turing machine codes $W^{c o d e}$ into $W$ (by encoding the execution of a UTM into the weights of your transformer)
Used $p (y | x, w)$ parametrised by the transformer, where $w \in W$ to provide what I would call a "smooth relaxation" of the execution of the UTM for some number of steps
Use this as the model in the usual SLT setting, and then noted that because of the way you encoded the UTM and its step function, if you vary $w$ away from the configuration corresponding to a TM code $[M]$ in a bit of the description that corresponds to unused states or symbols, it can't affect the execution and so there is degeneracy in the KL divergence $K$
Hence, $λ ([M]) \leq len ([M])$ and if then repeating this over all TMs $M$ which perfectly fit the given data distribution, we get a bound on the global $λ \leq K$ .

Proving Theorem 4.1 was the purpose of Clift-Wallbridge-Murfet, just with a different smooth relaxation. The particular smooth relaxation we prefer for theoretical purposes is one coming from encoding a UTM in linear logic, but the overall story works just as well if you are encoding the step function of a TM in a neural network and I think the same proof might apply in your case.

Anyway, I believe you are doing at least several things differently: you are treating the iid case, you are introducing $D (μ | | ξ)$ and the bound on that (which is not something I have considered) and obviously the Transformer running in recurrent mode as a smooth relaxation of the UTM execution is different to the one we consider.

From your message it seems like you think the global learning coefficient might be lower than $K$ , but that locally at a code the local learning coefficient might be somehow still to do with description length? So that the LLC in your case is close to something from AIT. That would be surprising to me, and somewhat in contradiction with e.g. the idea from simple versus short [LW · GW] that the LLC can be lower than "the number of bits used" when error-correction is involved (and this being a special case of a much broader set of ways the LLC could be lowered).

Replies from: Lblack

↑ comment by Lucius Bushnaq (Lblack) · 2025-03-10T09:04:07.877Z · LW(p) · GW(p)

Based on my understanding of what you are doing, the statement in the OP that in your setting is "sort of" K-complexity is a bit misleading?

Yes, I guess it is. In my (weak) defence, I did put a '(sort of)' in front of that.

In my head, the relationship between the learning coefficient and the K-complexity here seems very similar-ish to the relationship between the K-complexities of a hypothesis $h$ expressed on two different UTMs.

If we have a UTM $M_{1}$ and a different UTM $M_{2}$ , we know that $K (h, M_{2}) \leq K (h, M_{1}) + K (M_{1}, M_{2})$ , because if nothing else we can simulate UTM $M_{1}$ on UTM $M_{2}$ and compute $h$ on the simulated $M_{1}$ . But in real life, we'd usually expect the actual shortest program that implements $h$ on $M_{1}$ to not involve jumping through hoops like this.

In the case of translating between a UTM and a different sort of Turing-complete model of computation, namely a recurrent neural network^[1], I was expecting a similar sort of dynamic: If nothing else, we can always implement $h$ on the NN by simulating a UTM, and running $h$ on that simulated UTM. So the lowest LLC parameter configuration that implements $h$ on the NN has to have an LLC that is as small or smaller as the LLC of a parameter configuration that implements $h$ through this simulation route. Or that was the intuition I had starting out anyway.

If I understand correctly you are probably doing something like:

Seems broadly right to me except:

Third bullet point: I don't know what you mean by a "smooth relaxation" precisely. So while this sounds broadly correct to me as a description of what I do, I can't say for sure.
Sixth bullet point: You forgot the offset term for simulating the UTM on the transformer. Also, I think I'd get a constant prefactor before $len ([M])$ . Even if I'm right that the prefactor I have right now could be improved, I'd still expect at least a $\frac{1}{2}$ here.

I'd caution that the exact relation to the learning coefficient and the LLC is the part of this story I'm still the least confident about at the moment. As the intro said

This post is my current early-stage sketch of the proof idea. Don't take it too seriously yet. I’m writing this out mostly to organise my own thoughts.

I've since gotten proof sketches for most of the parts here, including the upper bound on the LLC, so I am a bit more confident now. But they're still hasty scrawlings.

you are treating the iid case

I am not sure whether I am? I'm a bit unclear on what you mean by iid in this context exactly. The setup does not seem to me to require different inputs $x$ to be independent of each other. It does assume that each label $y$ is a function of its corresponding input $x$ rather than some other input. So, label $y_{1}$ can depend on input $x_{1}$ , but it can only depend on $x_{2}$ in a manner mediated by $x_{1}$ . In other words, the joint probability distribution over inputs can be anything, but the labels must be iid conditioned on their inputs. I think. Is that what you meant?

From your message it seems like you think the global learning coefficient might be lower than $K$ , but that locally at a code the local learning coefficient might be somehow still to do with description length? So that the LLC in your case is close to something from AIT. That would be surprising to me, and somewhat in contradiction with e.g. the idea from simple versus short [LW · GW] that the LLC can be lower than "the number of bits used" when error-correction is involved (and this being a special case of a much broader set of ways the LLC could be lowered).

I have been brooding over schemes to lower the bound I sketched above using activation error-correction blocks. Still unclear to me at this stage whether this will work or not. I'd say this and the workability of other schemes to get rid of the $f$ prefactactor to $K (μ, M)$ in the bound are probably the biggest source of uncertainty about this at the moment.

If schemes like this work, the story here probably ends up as something more like ' $K (μ, M)$ is related to the number of bits in the parameters we need to fix to implement $μ$ on the transformer.'

In that case, you'd be right, and the LLC would be lower, because in the continuum limit we can store an arbitrary number of bits in a single parameter.

I think I went into this kind of expecting that to be true. Then I got surprised when using less than one effective parameter per bit of storage in the construction turned out to be less straightforward than I'd thought once I actually engaged with the details. Now, I don't know what I'll end up finding.

^{^}
Well, transformers are not actually Turing complete in real life where parameters aren't real numbers, because if you want an unbounded context window to simulate unbounded tape, you eventually run out of space for positional encodings. But the amount of bits they can hold in memory does grow exponentially with the residual stream width, which seems good enough to me. Real computers don't have infinite memory either.

comment by Linda Linsefors · 2025-02-13T14:37:41.022Z · LW(p) · GW(p)

The time and space bounded SI is then approximately optimal in the sense that its total error compared to this efficient predictor, as measured by KL-divergence from the predictions made by , will be $\leq | P^{*} |$ bits summed across all data points.^[7]

I think that classical Solomonoff induction gives zero posterior to any program with less than perfect prediction record? I can see why this works for Solomonoff with unbounded description length, this is solved by the DH(h) term you mention above.

But for bounded Solomonoff you have to allow some non-perfect programs to stay in the posterior, or you might be left with no candidates at all?

Is there an easy way to deal with this?

This is not a problem if the programs are giving probabilistic outputs, but that's a different class of programs than used in classical Solomonoff induction.

Replies from: Lblack

↑ comment by Lucius Bushnaq (Lblack) · 2025-02-13T14:44:24.601Z · LW(p) · GW(p)

I do use programs that give probabilistic outputs here. See claim 1 and the setup section.

I am fairly sure that there is a version of Solomonoff induction where the programs themselves output probabilities, and it's equivalent in the limit to the version where programs output binary answers. I think it's called 'stochastic' Solomonoff induction, or something like that.

I hadn't actually realised how load-bearing this might be for the proof until you pointed it out though. Thank you.

comment by Matthias Dellago (matthias-dellago) · 2025-02-11T15:43:20.574Z · LW(p) · GW(p)

Very nice! Alexander and I were thinking about this after our talk as well. We thought of this in terms of the kolmogorov structure function and I struggled with what you call Claim 3, since the time requirements are only bounded by the busybeaver number. I think if you accept some small divergence it could work, I would be very interested to see.

Replies from: Lblack

↑ comment by Lucius Bushnaq (Lblack) · 2025-02-11T16:03:56.530Z · LW(p) · GW(p)

For claim 3, I think we just want to assume that the process we are trying to predict doesn’t have time requirements that are too large for us to make a prediction we are happy with. I think this has to be an assumption about the data we make because it is just genuinely not true of many processes we can conceive of, and I don’t think deep learning would work to predict those processes. Many parts of the real world we care about just turn out to be the efficiently predictable.

Replies from: matthias-dellago

↑ comment by Matthias Dellago (matthias-dellago) · 2025-02-11T21:18:49.285Z · LW(p) · GW(p)

"Many parts of the real world we care about just turn out to be the efficiently predictable."

I had a dicussion about exactly these 'pockets of computational reducibility' today. Whether they are the same as the more vague 'natural abstractions', and if there is some observation selection effect going on here.

comment by Lucas Teixeira · 2025-02-24T20:02:52.894Z · LW(p) · GW(p)

Some proofs along these lines already exist [1,2], though they seem to have minor problems.

This is a good critical review of the literature.

Replies from: Lblack

↑ comment by Lucius Bushnaq (Lblack) · 2025-02-24T22:20:02.911Z · LW(p) · GW(p)

Yeah, the difference between what those papers show and what I need turned out to be a lot bigger than I thought. I ended up making my own construction instead.

This actually turned out to be the most time consuming part of the whole proof. The other steps were about as straightforward as they looked.

Replies from: Lucas Teixeira

↑ comment by Lucas Teixeira · 2025-02-25T01:47:18.481Z · LW(p) · GW(p)

Interesting. Curious to know what your construction ended up looking like and I'm looking forward to reading the resulting proof!

comment by Max Hennick (max-hennick-1) · 2025-02-11T20:17:41.857Z · LW(p) · GW(p)

So I have been playing around with some similar ideas but from a symbolic dynamics perspective. In that setting one can define a free energy over strings. There also exists a fractal dimension related to compressibility. The free energy and this fractal dimension are seemingly related through SI, and I suspect this fractal dimension is related to the LC. Happy to share if you think this might be useful for you.

Proof idea: SLT to AIT

Contents

Background:

Proof idea outline:

Setup: Predicting a stochastic process

Claims I want to prove:

Comments:

15 comments

Background

Continuous model classes

From AIT to SLT

Free parameters and learning coefficients

Learning coefficient vs K