Memorization-generalization in practice

dmitry-vaintrob

Memorization-generalization in practice

post by Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-30T14:10:48.239Z · LW · GW · 1 comments

  Practical measurements of the memorization-generalization spectrum 
None
1 comment

Short post today, which is part II.1 or my series on tempering and SLT (see part [LW · GW] one [LW · GW] here [LW · GW]). In this post I’ll explain in a bit more detail the “in practice” connection that experiments should see between the learning coefficient spectrum, tempering, and empirical measurements of the learning coefficient. In future installments of this part I’ll explain a bit the theory behind this and how the predictions from information theory [LW · GW] have remarkable qualitative agreement with predictions from singular learning theory (with the caveat that in the SLT picture, each circuit is "fuzzy", and has a small continuous spectrum of its own). I'll then relate this picture to some notions inherent in the generalized “field theory” approach to modeling neural nets.

Practical measurements of the memorization-generalization spectrum

I’m trying to do less of the thing where I hide experimentally-relevant points behind a wall of theory, so let me try to explain the “upshots” of this part ahead of time, and talk about theory later (in future installments of part II of this series).

Tempering is implemented in practice by sampling algorithms, usually variants of “SGLD” (Langevin gradient descent) in an ML context. Like for usual SGD, there are various optimization protocols that make it more efficient. There is a whole science of how to check if “sampling worked”, and sampling quality/best practice is an active area of research where the SLT crowd is making exciting progress. In my experience, sampling algorithms (at a minimum) work well for toy models, and agree with “expected results” when such expectations are known.
Tempering works by gradually trading off performance for entropy (as will be explained below), in a way that is mathematically analogous to adding heat to a physical system. In practice, this means that tempering inductively “noises out” the least efficient circuits in a neural net, and it stops noising circuits when the increase in loss (compared to the initial fully-trained model) starts getting significantly higher than the “temperature” parameter.
Tempering is a stochastic process. Often we’re interested in the “generic behavior” of a randomly selected tempered program (corresponding to running an experiment on specific system at some fixed temperature). In other cases, we may be interested in expectation values over tempered programs, performed in practice by averaging the programs encountered in one or more “sampling traces”.
The result of tempering can be read off of the “circuit efficiency [LW · GW]” spectrum, and conversely the spectrum of efficiencies (in the language of the “bucket of circuits” post these are the slopes, not the complete 2-dimensional data) can be read off of tempering measurements. The process of converting a “bucket of circuits” to a tempering prediction is as follows (with various modifications needed in various contexts):
1. Consider a specific temperature t.
2. Figure out the “log odds change” inherent in the loss. Note that this step is a little tricky and context-dependent; “generically” and in the high-data limit, it is given by $Δ_{precision} (t) \approx log (\frac{L_{0} + t}{L_{0}}),$ where $L_{0}$ is the loss of the fully-trained model. Note that getting this function exactly right isn’t that important for experimentalists, as it is reasonable to instead manually tune the temperature until it puts you in a regime of interest.
3. Inductively noise out the lowest-efficiency circuits until the total $Δ_{precision}$ from the circuits matches the value $Δ_{precision} (t)$ .
4. The prediction for the tempered model is now the result of noising out these “inefficient circuits”. In particular, running an interpretability experiment on the tempered model should be expected to fail if it extracted information about the noised-out circuits, and succeed if it extracted information about surviving circuits.
5. The learning coefficient can now be recovered from sampling the loss for tempered models^[1].

The recipe in point 4 above can be reversed to extract the circuit efficiency spectrum from empirical measurements of the tempering process. So if you have a trained neural net, in order to get the "spectrum of slopes" one should run the following process.

Vary temperature t and measure the resulting learning coefficient function $λ (t)$ (roughly, the variance in loss at fixed temperature).
The "bucket of circuits" picture predicts that, as you vary t on a logarithmic scale for a simple or toy algorithm, your learning coefficients will mostly be flat, with phase transitions at "new circuit discovery" temperatures -- or more precisely, temperatures where the tempering process has noised out all circuits below a fixed efficiency level and needs to jump up to the next efficiency.
The length of each flat stretch as you vary the temperature then measures the total (log) loss improvement of the set of circuits at that efficiency. Thus one can model "longer" stretches at a fixed learning coefficient as efficiency values that are "more populated" (i.e., correspond to more, or bigger, circuits). This is exactly analogous to a physics measurement of the intensity in the frequency spectrum of photon emission in an experiment. In this analogy, each frequency of emission comes in tandem with an "intensity" parameter^[2], and higher-intensity emission lines imply that there are more particles emitting at this frequency ( $⟺$ more circuits working at this efficiency).
Note that as I explained, this process only picks up the "efficiencies," i.e., slopes, from the two parameters (complexity and loss improvement) of the "bucket of circuits" diagrams. If you want to also capture the complexity of constituent circuits, one way to get a glimpse of this is to capture the spectrum of slopes at different points in the training process of the model^[3]. In a simplified picture, circuits that appear earlier in the training process have lower complexity; so the two scales of (temperature, training time) nonlinearly map onto the two measurements (loss reduction, complexity) of circuits. However while I expect efficiency measurements for toy models tend to be relatively clean and correctly group circuits by efficiency, the correspondence between "complexity" and "training time" is messier (as I explained last time, it's "thermodynamic" rather than "thermostatic").

Note that "classically", singular learning theory has considered only one temperature scale: namely, the scale that is "just above the memorization scale". As such, in my picture, the learning coefficient at this scale captures exactly information about "the least efficient circuit being implemented that has higher efficiency than memorization". In physics parlance, the process of looking at the "smallest nontrivial point in a spectrum" is usually called capturing the "spectral gap", and as such classic SLT measurements live exactly at the spectral gap of the spectrum of efficiencies.

Sweeping through the full temperature scale should recover more of this spectrum. I'll explain later in this part how one can use field theory ideas to further split the spectrum to distinguish equally-efficient circuits and perhaps start probing some phenomena beyond circuit efficiency.

^{^}
Roughly: tempering means we ask the “loss precision isn’t much worse than t”, and the learning coefficient measures the variance. If improving the loss is very entropically expensive, then tempered NNs will be “very resistant” to increase the loss below their minimal allowed value, and this variance will be small. Note that for the conceptual cartoons I’m blurring out the difference between so-called “microcanonical” and “canonical” quantities, and real tempering has “soft” exponential cutoffs rather than exact “loss bounded by this value”-style effects.
^{^}
In a classic experimental setup where you split light into frequencies with a prism, intensity is literally "how bright" the corresponding line is.
^{^}
Note that there are some issues with carefully modeling tempering during pretraining that SLT has tricks to get around.

1 comments

Comments sorted by top scores.

comment by jacob_drori (jacobcd52) · 2025-03-27T17:14:03.023Z · LW(p) · GW(p)

Vary temperature t and measure the resulting learning coefficient function

This confuses me. IIUC, $p_{β, n} (w) = \frac{φ (w) exp (- n β L_{n} (w))}{\int d w^{'} φ (w^{'}) exp (- n β L_{n} (w^{'}))}$ . So changing temperature is equivalent to rescaling the loss by a constant. But such a rescaling doesn't affect the LLC.

What did I misunderstand?

Memorization-generalization in practice

Contents

Practical measurements of the memorization-generalization spectrum

1 comments