Proof idea: SLT to AIT
post by Lucius Bushnaq (Lblack) · 2025-02-10T23:14:24.538Z · LW · GW · 6 commentsContents
Background: Proof idea outline: Setup: Predicting a stochastic process Claims I want to prove: Comments: None 6 comments
I think we may be able to prove that Bayesian learning on transformers[1] or recurrent neural networks with a uniform[2] prior over parameters is equivalent to a form of Solomonoff induction [? · GW] over a set of computationally-bounded programs. This bounded Solomonoff induction would still be 'approximately optimal' in a sense, being able to predict the data about as well any other bounded prediction procedure included in the set of programs it runs over. This proof would link Singular Learning Theory [? · GW] (SLT) back to basic Algorithmic Information Theory (AIT).
This post is my current early-stage sketch of the proof idea. Don't take it too seriously yet. I’m writing this out mostly to organise my own thoughts. I'd originally planned for it to be a shortform, but I think it ended up a bit too long for that.
Background:
I recently held a small talk presenting an idea for how and why deep learning generalises. Slides for the talk here, slide discussion here [LW · GW].
In the talk, I tried to reduce concepts from SLT[3] back to AIT[4]. I sketched a story about deep learning, or perhaps even learning more generally, that goes like this:
- Bayesian learning on (recurrent) neural networks is equivalent to a form of Solomonoff induction running over a set of programs bounded in length, runtime and memory usage.
- Using SGD/genetic algorithms/your-fancy-update-method-of-choice to train a neural network is then a cheap bargain bin[5] approximation of Bayesian learning on the neural network. Training steps are biased to make simple updates rather than complex updates because exponentially more parameter configurations in the architecture correspond to simpler programs.
Now, I want to actually prove this story. Specifically, I want to prove the first part: That Bayesian learning on transformers or RNNs is equivalent to a computationally bounded form of Solomonoff Induction (SI), in a sense I want to make precise. I also want to show that this bounded SI is a sensible approximation of actual unbounded SI, given some assumption of tractable predictability for the data. That is, we assume that it is possible at all to predict the data to good-enough accuracy with a limited computational budget. See e.g. these posts [1 [LW · GW],2 [LW · GW],3 [LW · GW]] if you want some idea of why we might think that this is a reasonable assumption to make for a lot of real-world data.[6]
In the following, I’ll sketch a skeleton of the proof as I envision it right now, dividing it up into individual claims that seem tractable to prove to me. I’ve only just started on this, so some of the details here are probably wrong. Other details of the setup might just turn out to be hard to work with and get changed out for something better.
Proof idea outline:
Setup: Predicting a stochastic process
We want to learn to predict some stochastic process from inputs to outputs . We'll model the process as some function that takes two inputs: One input that we get to see, and one input we don't get to see, and returns one output . is included to represent potential randomness in the process. It is drawn from some distribution with entropy .
Specifically, we define , where is a function, and are both tuples of booleans respectively, and is a single boolean. We get to observe many pairs .
Our prediction for will be probabilistic. Without loss of generality, we will make our prediction by naming the probability that is a rather than a . We’re about to think about this with programs, so we’ll restrict our to be floating point numbers of some finite precision .
Claims I want to prove:
Claim 1: The ‘best’ way to make this prediction is to run a version of Solomonoff induction. By best, I mean we get a guarantee about the total error of our prediction ala Solomonoff completeness. Specifically, we take a Universal Turing Machine(UTM) , which accepts the input and some program that outputs bits, representing the probability . We then run Solomonoff induction (SI) over the set of all such programs to make our prediction. The total error of our prediction in terms of KL-divergence measured in bits, across data points, should then be bounded below , where is the length of the shortest program that implements on the UTM and is the entropy of our random variable .
Proof status: This is just a slight variation of the textbook SI setup I am familiar with. I hope a proof of this already exists.
Claim 2: Solomonoff induction over length-bounded UTMs can still make approximately optimal predictions if is simple. Say the function has a description length of bits in our UTM . If we know or heavily suspect that , we can run a version of SI that excludes programs above bits in length, and the result will still be very close to optimal in the sense described in claim 1. This length-bounded version would use a UTM that accepts plain program codes, not prefix-free program codes. It’d runs over all possible programs of length , and assign a uniform prior to every program.
Proof status: I’d guess this proof might also exist somewhere already. If not, I think we can show it? See the discussion here [LW(p) · GW(p)] and here [LW(p) · GW(p)]. Also, I’ve since been pointed to the central lemma I think you’d want for this in this book on pages 145-146.
Claim 3: If we further limit our Solomonoff induction to only include programs of runtime and space , our prediction can still be approximately optimal in a sense. For this, we will assume that is ‘efficiently predictable’. By ‘efficiently predictable’, we mean that there is some program of length , that could be instantiated on , requiring space and runtime, which makes predictions that we consider ‘good-enough’ for our purposes. That is, the expected KL-divergence of its predictions from the actual distribution of is small enough that we would be happy with it. The time and space bounded SI is then approximately optimal in the sense that its total error compared to this efficient predictor, as measured by KL-divergence from the predictions made by , will be bits summed across all data points.[7]
Proof status: Seems doable? We can just walk through the same style of argument used to show Solmonoff completeness for unbounded SI: Our SI will catch up to any such about as fast as programs simpler than are excluded by the training data.
Claim 4: Our bounded Solomonoff Induction is not very sensitive to the UTM we run it on. Just as in regular Solomonoff induction. If we use another UTM that can be implemented in the previous UTM using bits, bounded SI on restricted to programs bits long or more, with runtime and space, will still be approximately-optimal in the sense specified in claim 3, though our total error across all data compared to the efficient predictor will now be bounded to bits instead of bits.
Proof status: I think we can just use existing results on efficient simulation to show this?
Claim 5: Transformers and RNNs are equivalent to a bounded UTM for our purposes. Consider e.g. a transformer in inference mode, with context window and parameters. At the end of each forward pass, its output is fed back into its input layer. This happens for a total of forward passes. On the first forward pass, it is passed the input data , padding the rest of the input with zeroes. On the final forward pass, its output is interpreted as the probability . This transformer is equivalent to a bounded UTM accepting programs of length , with a maximum runtime of , with a space bound determined by and the residual stream width.
Proof status: Some proofs along these lines already exist [1,2], though they seem to have minor problems. Mostly, I'm peeved that they use hardmax instead of softmax in the attention. This seems like a choice made for purely technical reasons though, they just didn't want a function that doesn't return rational numbers because that's annoying for their proof.
Claim 6: Therefore, Bayesian learning on a transformer or RNN is a form of bounded Solomonoff induction, with the approximate optimality properties of the bounded induction outlined in claims 3 and 4. SLT[3] describes Bayesian learning on neural networks, we have now shown SLT to be describing a form of Solomonoff induction.
Proof status: Immediately follows if we prove the other five claims.
If anyone points out errors or confusions in any of these claims or proof ideas, I'd appreciate it a lot.
Comments:
On the meaning of the learning coefficient: Since the SLT[3] posterior would now be proven equivalent to the posterior of a bounded Solomonoff induction, we can read off how the (empirical) learning coefficient in SLT relates to the posterior in the induction, up to a conversion factor equal to the floating point precision of the network parameters.[8] This factor is there because SLT works with real numbers whereas AIT[4] works with bits. Also, note that for non-recursive neural networks like MLPs, this proof sketch would suggest that the learning coefficient is related to something more like circuit complexity than program complexity. So, the meaning of from an AIT perspective depends on the networks architecture. It's (sort of)[9] K-complexity related for something like an RNN or a transformer run in inference mode, and more circuit complexity related for something like an MLP.
On real world efficiency: This would be an AIT[4]-style proof. None of this says very much about efficiency in the real world. It is not telling us whether we ought to use transformers or RNNs or Mamba for maximum efficiency, ReLUs or sigmoids, or anything of the kind. It's just showing, starting from an AIT perspective, why anything in a superset that includes all of these options would be able to do learning at all.
On wider applicability: While the proof will be for deep learning on recurrent neural networks, I think the implications here could be much broader than that. I think this story of how learning is possible at all might also apply to in-context learning, and perhaps even many alternate setups outside of ‘deep learning’ that have not been invented yet.
Thanks to Stefan Heimersheim for proofreading and comments. Thanks to Dmitry Vaintrob, Kaarel Hänni and Linda Linsefors for discussion. Thanks again to all of these [LW(p) · GW(p)] people. Thanks to everyone else at Apollo who endured my rants about this.
- ^
Transformers run in inference mode. So, imagine something like RL training, but with Bayesian updating instead of gradient descent.
- ^
I don't think using a Gaussian or some other typical NN initialisation scheme instead would change much, but I plan to prove it for a uniform distribution first because that seems more straightforward.
- ^
Singular Learning Theory
- ^
Algorithmic Information Theory
- ^
Some things Bayesian learning can do SGD just can't do. See e.g. the parity learning problem [LW · GW].
- ^
Also, observe how a lot of the real world does not seem like completely unpredictable tv static to your computationally limited brain.
- ^
Credit to the logical induction paper for inspiring this idea. Seeing their proof that LIA isn’t exploitable by other traders than run in polynomial time is what got me thinking about this.
- ^
Ok, really should probably be related to the posterior of the induction on a UTM by first dividing it by the floating point precision of the neural network parameters, then letting that floating point precision tend to infinity.
- ^
Only sort-of related because our prior only has support on programs bounded in runtime, space and description length, not on all programs. Also, K-complexity and (semi)-measure in the Solomonoff prior aren't identical [LW(p) · GW(p)] even in the conventional unbounded version of Solomonoff induction.
6 comments
Comments sorted by top scores.
comment by Linda Linsefors · 2025-02-13T14:37:41.022Z · LW(p) · GW(p)
The time and space bounded SI is then approximately optimal in the sense that its total error compared to this efficient predictor, as measured by KL-divergence from the predictions made by , will be bits summed across all data points.[7]
I think that classical Solomonoff induction gives zero posterior to any program with less than perfect prediction record? I can see why this works for Solomonoff with unbounded description length, this is solved by the DH(h) term you mention above.
But for bounded Solomonoff you have to allow some non-perfect programs to stay in the posterior, or you might be left with no candidates at all?
Is there an easy way to deal with this?
This is not a problem if the programs are giving probabilistic outputs, but that's a different class of programs than used in classical Solomonoff induction.
↑ comment by Lucius Bushnaq (Lblack) · 2025-02-13T14:44:24.601Z · LW(p) · GW(p)
I do use programs that give probabilistic outputs here. See claim 1 and the setup section.
I am fairly sure that there is a version of Solomonoff induction where the programs themselves output probabilities, and it's equivalent in the limit to the version where programs output binary answers. I think it's called 'stochastic' Solomonoff induction, or something like that.
I hadn't actually realised how load-bearing this might be for the proof until you pointed it out though. Thank you.
comment by Matthias Dellago (matthias-dellago) · 2025-02-11T15:43:20.574Z · LW(p) · GW(p)
Very nice! Alexander and I were thinking about this after our talk as well. We thought of this in terms of the kolmogorov structure function and I struggled with what you call Claim 3, since the time requirements are only bounded by the busybeaver number. I think if you accept some small divergence it could work, I would be very interested to see.
Replies from: Lblack↑ comment by Lucius Bushnaq (Lblack) · 2025-02-11T16:03:56.530Z · LW(p) · GW(p)
For claim 3, I think we just want to assume that the process we are trying to predict doesn’t have time requirements that are too large for us to make a prediction we are happy with. I think this has to be an assumption about the data we make because it is just genuinely not true of many processes we can conceive of, and I don’t think deep learning would work to predict those processes. Many parts of the real world we care about just turn out to be the efficiently predictable.
Replies from: matthias-dellago↑ comment by Matthias Dellago (matthias-dellago) · 2025-02-11T21:18:49.285Z · LW(p) · GW(p)
"Many parts of the real world we care about just turn out to be the efficiently predictable."
I had a dicussion about exactly these 'pockets of computational reducibility' today. Whether they are the same as the more vague 'natural abstractions', and if there is some observation selection effect going on here.
comment by Max Hennick (max-hennick-1) · 2025-02-11T20:17:41.857Z · LW(p) · GW(p)
So I have been playing around with some similar ideas but from a symbolic dynamics perspective. In that setting one can define a free energy over strings. There also exists a fractal dimension related to compressibility. The free energy and this fractal dimension are seemingly related through SI, and I suspect this fractal dimension is related to the LC. Happy to share if you think this might be useful for you.