Low Temperature Solomonoff Induction
post by dil-leik-og (samuel-buteau) · 2024-12-06T18:55:08.948Z · LW · GW · 3 commentsContents
Unifying Complexity and Multiplicity (Version 1) Why Max-Entropy? Why is the Fair coin Hypothesis “simple”? Why Physical laws are symmetrical? Why Physics Ignores Compute Costs? When there are truly no compute costs, simplicity is actually close to code-space volume Why Prefix Convention Doesn't Matter (Until We Break Things) Why UTM Choice Matters (But Only Constantly) The Core Problem The Quantitative Impact Breaking the Guardrail Unifying Complexity and Multiplicity (Version 2) None 3 comments
Supposing that agentic hypotheses are more complex than non-agentic ones, is it possible to reduce the impact of the agentic ones by penalizing complexity more?
Consider a hypothetical hypercomputer capable of running Solomonoff induction. Rather than using it to generate plans directly, we want to predict risks in plans generated by other processes (use it as a “guardrail”). Assume that within Solomonoff induction's hypothesis space, all 'agentic' hypotheses take the form of 'world model' + 'goal specification', and that adding goal specification is very simple - perhaps only 1 bit beyond the complexity of faithful world models (themselves bits). While having ¼ or ⅛ of the posterior composed of these agentic hypotheses would be problematic, a much smaller proportion (say ) would be acceptable. What happens if we increase the cost of complexity in the prior, making each bit reduce mass by a factor of 4 or 8 or ? This would drive the influence of agentic hypotheses relative to faithful world models to a very small proportion. We call this operation 'lowering the temperature', as the prior of a program becomes proportional to with L the length of a program in bits, and the temperature parameter. Normal Solomonoff induction is the special case where .
Unifying Complexity and Multiplicity (Version 1)
See Unifying Complexity and Multiplicity (Version 2) if you find Version 1 too rambly. (Version 1 is a bit rambly but traces the rabbit hole better, Version 2 optimizes for the weaknesses of Version 1) .
When measuring computational simplicity, it is tempting to separate Kolmogorov complexity (length of shortest program) from multiplicity (number of programs that implement it). However in algorithmic information theory it is common to reject this separation entirely in favor of a more natural quantity: where P is the set of all programs implementing the computation and is the length of program in bits. aka the log of the Solomonoff prior mass.[1]
This unification reveals something fundamental: both measurements count the fraction of program-space implementing our computation. A short program claims a large fraction by leaving more bits unspecified. Multiple programs claim fractions through their union. The distinction dissolves - we're always measuring how many ways a computation could be implemented.
Many approaches start by assuming Occam's Razor - that simpler theories are more likely. But this view reveals Occam's Razor as emerging naturally. Simpler theories aren't privileged by assumption - they're more likely because they literally occupy larger fractions of the possible ways a computational world could be, and it is possible to value simplicity too much if this reduces the fraction of code space we cover.
This rests on two premises: our future observations must be computable (some program could generate them), and we are ignorant of which program is actually running.
Given our ignorance of which program is running, we should be maximally uncertain—but how do we formalize uncertainty over programs with arbitrary trailing tokens? Let's examine this carefully.
Imagine an absurdly long file for writing Python code, with an end-of-file (EOF) marker that can be placed anywhere - everything after it is ignored during execution. In this purely deterministic setting, maximum entropy means being uniform over all possible contents. Each potential sequence of characters is equally likely, making every program-plus-ignored-suffix equally probable.
For clarity, we'll specify: random seeds go after the EOF marker. Since generating N truly random bits requires a -bit seed anyway, programs might as well read their 'random' bits directly from this post-EOF sequence.
With unbounded RAM and runtime, programs could need arbitrarily many random bits. We must either allow infinite files (capturing all possible computable worlds) or use finite but enormous files (excluding some theoretically possible but practically irrelevant computations). We'll choose the infinite option.
To avoid dealing with valid/invalid ASCII Python code, we'll switch to binary code for a universal Turing machine. Our interpreter accepts any infinite bitstring, treating everything after the delimiter as the randomness source.
Using delimiters ensures no program is a prefix of another, but it's inefficient. We can achieve the same prefix-free property by making programs self-delimiting: for instance, encoding program x by prefixing it with its length and a zero bit. This is a crude scheme, but it demonstrates the concept.
When we talk about 'a program that uses randomness,' we're really describing a set of infinite bitstrings - all possible ways that program could execute with different random bits. While this includes the 'unlucky' bitstring where every coin flip is heads, such cases are rare. What matters is the behavior of the entire set, not individual members.
In summary, the infinite bitstring framework: each program with its delimiter prefixes some region of infinite bitstring space, with these regions never overlapping due to the delimiter property. This property enables a clean formalization: in prefix-free codes, no program is a prefix of another, and you can't add new programs without breaking this property. Each program of length L gets probability , summing to 1 across all programs - exactly equivalent to putting a uniform distribution over infinite bitstrings and measuring how many start with each prefix.
When we discuss maximum entropy over infinite bitstrings, we're not claiming there's some cosmic hard drive storing them all, nor that every program runs in parallel (though that would make our framework rather self-justifying). We're making a simpler claim: our observations follow some computational process, we don't know which one, and proper uncertainty means being maximally uncertain over all computable possibilities. Our normalized prior and maximum entropy requirement yield a simple formula: each prefix of length L has probability , up to a constant factor based on our choice of universal Turing machine.
Why Max-Entropy?
~Directly from 'A Semitechnical Introductory Dialogue on Solomonoff Induction' [1]
Maximum entropy represents having no extra knowledge beyond the basic structure of program space. When our prior differs from the distribution generating observations, we can quantify the cost through a new measure: the error budget.
For any hypothesis and prior, we can compute an error budget. In Solomonoff Induction (SI), each program of length L gets prior weight . When a 100MB program assigns twice the likelihood to some observation compared to SI, that's one bit of error - the evidence doubles the weight we assign to that program. SI can only make a bounded number of such errors, as each shifts probability mass from SI-minus-program to the program. With an initial ratio of 1:, we're constrained by this 100MB budget regardless of when errors occur.
When we sample hypotheses from some distribution, we can take the expected error budget of these hypotheses under our prior. This expectation is minimized when the prior matches the sampling distribution. Since we have no extra knowledge about which programs are more likely to be running the universe, we model hypotheses as coming from the maximum entropy distribution - and thus use it as our prior. One might try to do better by increasing the prior of all hypotheses, but small error budgets are a finite resource when doing bayesian updating, and that doesn't change when we multiply the prior by some constant.
Why is the Fair coin Hypothesis “simple”?
In updating our beliefs, we work with regions of infinite bitstring space defined by programs. Our predictions shouldn't depend on how we choose to group these regions. At temperature 1, we can either filter infinite bitstrings directly or use likelihood ratios for sets of bitstrings - the math works out identically. A program representing 70% ones and 30% zeros yields the same predictions whether we treat it as one region or split it into subregions.
Even without built-in probabilistic outputs, a fair coin naturally emerges from code-space. A program that reads and outputs the next bit from its random sequence outputs heads 50% of the time. While each individual outcome requires specifying a long bitstring, collectively these bitstrings occupy a large fraction of total code-space, making fair coins "simple" in our framework.
Making "P(heads)=50%" a language primitive is reasonable precisely because fair coins already occupy a large fraction of code-space. The primitive isn't creating simplicity - it's reflecting an inherent property of our computational framework.
Why Physical laws are symmetrical?
Physical laws exhibit remarkable symmetries - invariance across coordinate systems, gauge transformations, and other complex mathematical transformations. This pattern emerges naturally when we view these laws as occupying regions of code-space through multiplicity [2]. Different expressions of the same physical law, using different coordinate systems or gauge choices, are all valid programs producing identical observations. If we only counted the shortest description, we'd expect a single "optimal" representation lacking these symmetries.
A physical law's representation in different programming languages should occupy equivalent regions of code-space, differing only by a constant factor based on Universal Turing Machine choice. Changing temperature breaks this equivalence - it arbitrarily weights implementations differently despite their representing the same underlying law.
Importantly, when we modify temperature, we're not cleanly separating simplicity from multiplicity. What appears as "simplicity" in our language is already an inseparable mixture of both in true code-space, and likewise for what appears as "multiplicity". We're arbitrarily treating one mixture differently from another.
Why Physics Ignores Compute Costs?
Consider quantum computing: if -qubit quantum computers are possible (as most interpretations suggest), then the universe performs computation far beyond what's needed for photorealistic simulations. This hints at something profound - within observable scales, the laws of physics show no sign of trading computational resources for simpler descriptions. Whatever code-space vs compute tradeoff exists appears negligible.
When there are truly no compute costs, simplicity is actually close to code-space volume
When there are truly no compute costs, simplicity is actually close to code-space volume.
Physics provides striking examples of symmetries over massive objects, like gauge symmetry, where the 'naive programs' implementing physical laws must make specific choices. These programs need to set perhaps bits to pick a gauge, though the choice is largely arbitrary - say 10% of possible settings yield equivalent physics (with small adjustments needed between gauges).
In a world without compute costs, we can convert this multiplicity into simplicity with only a constant overhead (say bits) through a remarkable trick:
- Loop over all possible values of the symmetry-specifying bits
- Sort output sequences by how many bit-settings produce them (if 10% of these yield the same sequence, i.e. the correct sequence, there will at most be 100/10=10 sequences before the one we want)
- Index into your tally to retrieve the correct sequence and output that
If you somehow knew that not only 10% of the settings gave the correct answer, but that there were only 10 other answers, you might not need to sort in order to guarantee a simple index. But in general, you could have all different answers for the wrong sequences, which would give 90% times elements to index into.
This yields three distinct complexity levels:
- The base code-space complexity
- Our construct at only bits above that
- Naive implementations at bits above that
Our construct, while computationally absurd, demonstrates how multiplicity can be converted to simplicity when compute is free. It's effectively implementing a finite version of Solomonoff induction within a single program. The code-space volume is still slightly underestimated, but only by a constant bit factor.
This only works for exact equivalences - it breaks down for probabilistic outputs or infinite symmetry specifications. But it highlights the gap between intuitive simplicity and true code-space volume when we ignore compute costs entirely.
Why Prefix Convention Doesn't Matter (Until We Break Things)
Consider modifying our prefix convention by adding random bits before each program's delimiter. This doesn't change code-space fractions - while each extra bit costs in probability, it also adds two possibilities, balancing out perfectly (½ + ½ = 1).
But this invariance shatters when we change temperature. Our results become dependent on arbitrary choices: do '100000' and '100001' count as separate programs or variants of '10000'? We can't even calculate effects without first specifying such arbitrary groupings.
Why UTM Choice Matters (But Only Constantly)
Maximum entropy doesn't give perfect invariance across Universal Turing Machines (UTMs). But switching UTMs only costs us a constant bit penalty - the overhead of implementing the new UTM in our current one. This penalty is fixed regardless of what we're learning.
This provides a test: any effect that can't be reduced to a constant bit penalty must come from something deeper than UTM choice. It's a fundamentally larger distortion. At low temperatures (T→0), even this constant penalty becomes severe, growing as .
The Core Problem
Complexity and multiplicity are the same quantity - they both measure code-space occupation. When we modify temperature, we're not separating them cleanly - we're arbitrarily weighting different aspects of code-space based on superficial features of our chosen language. The resulting behavior becomes undefined, varying with each change in program representation.
The Quantitative Impact
Lowering temperature to T appears to increase SI's error budget relative to a length-L program from L bits to bits. But this ignores renormalization: since total probability must remain 1, we actually get a redistribution effect. Simpler hypotheses become easier to learn while complex ones become harder.
After renormalization, what remains fixed is the relative penalty between complexity levels: hypotheses with complexity difference have relative probability . As temperature changes become extreme, renormalization's effects dominate: at very low T, mass concentrates in the simplest hypotheses; at very high T, mass spreads more uniformly across complexity levels.
This has practical consequences: a dataset that contained just enough signal to learn the correct hypothesis at temperature T may fail completely at temperature . The same evidence becomes insufficient.
Breaking the Guardrail
Temperature 0 doesn't selectively eliminate dangerous hypotheses - it breaks potentially all balanced sets. Consider electromagnetism: different coordinate systems represent the same physical law, but in any language some representations will be simpler than others. At temperature 0, we arbitrarily privilege certain perspectives. This extends to randomness: if "heads" implementations happen to be simpler than "tails" implementations, zero temperature distorts the probabilities.
For statements orthogonal to our training data, temperature 1 SI might reasonably assign 50% probability. At temperature 0, the assignment becomes arbitrary, depending on how the posterior's code-space mass splits between simplicity and multiplicity. The system becomes confidently wrong about things it should be uncertain about (Or uncertain about things it should be confident about, depending on language details).
Out-of-distribution behavior becomes dangerously quirky. Instead of averaging over many similar programs to smooth out idiosyncrasies, temperature 0 commits to a single perspective. Even if the temperature 1 posterior contains the correct interpretation of our goals, the temperature 0 version might lock onto a wrong interpretation that happens to have the right simplicity-multiplicity balance.
Different safety properties pull us in opposite directions. Lower temperature might help separate agentic from non-agentic hypotheses, but it destroys our ability to make well-calibrated risk assessments. We have no formal guarantee that any temperature satisfies all our safety requirements simultaneously.
This undermines the core purpose of a safety guardrail. Risk assessments become meaningless - a 5% risk at temperature T might appear much smaller at T/2, not because hypotheses change their predictions, but because we're selectively silencing those that would raise concerns. Like other approaches that truncate the posterior (such as compute limits), we lose crucial information needed to catch potential problems. And for a guardrail to fail catastrophically, it need only be wrong about one important fact.
One might worry that as , if some knowledge exists only in some “expert” hypothesis, and some other knowledge exists only in some other “expert” hypothesis, we would lose part of the knowledge by choosing between them (according to who has most simplicity). However, in the absence of compute costs, there will be a program that contains these two “experts” and knows better than chance when to call each, such that this program would eventually (i.e. with enough train-time errors) take the influence away from both of these experts.
Unifying Complexity and Multiplicity (Version 2)
Let's start by introducing a particular framework for Solomonoff Induction and then exploring how we arrived at that framework and how it enables us to see complexity and multiplicity as two sides of the same coin. The framework considers the space of all infinite bitstrings, where programs are finite prefixes encoded in a prefix-free way (meaning no valid program is a prefix of another valid program). Solomonoff Induction operates (i.e. starts with it as a prior and does Bayesian updating on the observations) on this space of infinite bitstrings, with programs defining regions within it.
Why set things up in this way? If we think about programs more generally, we could imagine an absurdly long file for writing Python code, with an end-of-file (EOF) marker that can be placed anywhere – everything after it is ignored during execution (this is property of Python). If we want our program to be non-deterministic, e.g. it involves flipping a coin, we need random seeds. For ease of keeping our Python code short and simple, we'll put the random seeds after the EOF marker; the program can read its bits of randomness from this post-EOF sequence. This is a neat convention where two program executions with the same “code” but different random seeds are parts of the same region (“program”) of code space.
Next, noticing that we need to allow infinitely long files in order to cover all possible computable worlds, and that most Python code written at random in ASCII is not valid code, we decide to transport ourselves to the world of binary and work with infinite bitstrings. To view sets of these bitstrings as programs we include a way of identifying where the program ends and where the random bits begin, either through delimiters or self-delimiting encoding schemes.
When we talk about 'programs that use randomness,' we're really describing regions in infinite bitstring space - all possible ways that program could execute with different random bits. While this includes the 'unlucky' bitstring where every coin flip is heads, such cases are rare. What matters is the behavior of the entire set of possible executions, not individual members.
How does all this relate to complexity and multiplicity? When measuring computational simplicity, it is tempting to separate Kolmogorov complexity (length of shortest program) from multiplicity (number of programs that implement it). However in algorithmic information theory it is common to reject this separation entirely in favor of a more natural quantity, the log of the Solomonoff prior mass i.e. where P is the set of all programs implementing the computation and is the length of program in bits
This unification reveals something fundamental: both measurements count the fraction of program-space implementing our computation. A short program claims a large fraction by leaving more bits unspecified. Multiple programs claim fractions through their union. The distinction dissolves - we're always measuring how many ways a computation could be implemented.
We're now able to justify the Solomonoff prior by appealing to maximum uncertainty over all computable possibilities. This means treating all infinite bitstrings that could represent valid computations as equally likely a priori, which naturally leads to higher prior mass for shorter programs. In this view, Occam's Razor emerges naturally. Simpler theories aren't privileged by assumption - they're more likely because they literally occupy larger fractions of the possible ways a computational world could be, and it is possible to value simplicity too much if this reduces the fraction of code space we cover. Our normalized prior and maximum entropy requirement yield a simple formula: each prefix of length L has probability , up to a constant factor based on our choice of universal Turing machine.
[1] https://www.lesswrong.com/posts/EL4HNa92Z95FKL9R2/a-semitechnical-introductory-dialogue-on-solomonoff-1 [LW · GW]
[2] https://www.lesswrong.com/posts/KcvJXhKqx4itFNWty/k-complexity-is-silly-use-cross-entropy-instead [LW · GW]
3 comments
Comments sorted by top scores.
comment by Charlie Steiner · 2024-12-09T03:51:54.327Z · LW(p) · GW(p)
Temperature 0 is also sometimes a convenient mathematical environment for proving properties of Solomonoff induction, as in Li and Vitanyi (pdf of textbook).
Replies from: samuel-buteau↑ comment by dil-leik-og (samuel-buteau) · 2024-12-11T15:09:45.898Z · LW(p) · GW(p)
I was not able on a quick skim of the pdf to identify which passage you were referring to. If possible can you point me to an example Temperature 0 in the textbook?
Replies from: Charlie Steiner↑ comment by Charlie Steiner · 2024-12-12T01:38:28.237Z · LW(p) · GW(p)
Sorry, on my phone for a few days, but iirc in ch. 3 they consider the loss you get if you just predict according to the simplest hypothesis that matches the data (and show it's bounded).