Posts
Comments
Re: the SLT dogma.
For those interested, a continuous version of the padding argument is used in Theorem 4.1 of Clift-Murfet-Wallbridge to show that the learning coefficient is a lower bound on the Kolmogorov complexity (in a sense) in the setting of noisy Turing machines. Just take the synthesis problem to be given by a TM's input-output map in that theorem. The result is treated in a more detailed way in Waring's thesis (Proposition 4.19). Noisy TMs are of course not neural networks, but they are a place where the link between the learning coefficient in SLT and algorithmic information theory has already been made precise.
For what it's worth, as explained in simple versus short, I don't actually think the local learning coefficient is algorithmic complexity (in the sense of program length) in neural networks, only that it is a lower bound. So I don't really see the LLC as a useful "approximation" of the algorithmic complexity.
For those wanting to read more about the padding argument in the classical setting, Hutter-Catt-Quarel "An Introduction to Universal Artificial Intelligence" has a nice detailed treatment.
Typo, I think you meant singularity theory :p
Modern mathematics is less about solving problems within established frameworks and more about designing entirely new games with their own rules. While school mathematics teaches us to be skilled players of pre-existing mathematical games, research mathematics requires us to be game designers, crafting rule systems that lead to interesting and profound consequences
I don't think so. This probably describes the kind of mathematics you aspire to do, but still the bulk of modern research in mathematics is in fact about solving problems within established frameworks and usually such research doesn't require us to "be game designers". Some of us are of course drawn to the kinds of frontiers where such work is necessary, and that's great, but I think this description undervalues the within-paradigm work that is the bulk of what is going on.
It might be worth knowing that some countries are participating in the "network" without having formal AI safety institutes
I hadn't seen that Wattenberg-Viegas paper before, nice.
Yeah actually Alexander and I talked about that briefly this morning. I agree that the crux is "does this basic kind of thing work" and given that the answer appears to be "yes" we can confidently expect scale (in both pre-training and inference compute) to deliver significant gains.
I'd love to understand better how the RL training for CoT changes the representations learned during pre-training.
My observation from the inside is that size and bureaucracy in Universities has something to do with what you're talking about, but more to do with a kind of "organisational overfitting" where small variations of the organisation's experience that included negative outcomes are responded to by internal process that necessitates headcount (aligning the incentives for response with what you're talking about).
I think self-repair might have lower free energy, in the sense that if you had two configurations of the weights, which "compute the same thing" but one of them has self-repair for a given behaviour and one doesn't, then the one with self-repair will have lower free energy (which is just a way of saying that if you integrate the Bayesian posterior in a neighbourhood of both, the one with self-repair gives you a higher number, i.e. its preferred).
That intuition is based on some understanding of what controls the asymptotic (in the dataset size) behaviour of the free energy (which is -log(integral of posterior over region)) and the example in that post. But to be clear it's just intuition. It should be possible to empirically check this somehow but it hasn't been done.
Basically the argument is self-repair => robustness of behaviour to small variations in the weights => low local learning coefficient => low free energy => preferred
I think by "specifically" you might be asking for a mechanism which causes the self-repair to develop? I have no idea.
It's a fascinating phenomenon. If I had to bet I would say it isn't a coping mechanism but rather a particular manifestation of a deeper inductive bias of the learning process.
In terms of more subtle predictions. In the Berkeley Primer in mid-2023, based on elementary manipulations of the free energy formula, I predicted we should see phase transitions / developmental stages where the loss stays relatively constant but the LLC (model complexity) decreases.
We noticed one such stage in the language models, and two in the linear regression transformers in the developmental landscape paper. We only partially understood them there, but we've seen more behaviour like this in the upcoming work I mentioned in my other post, and we feel more comfortable now linking it to phenomena like "pruning" in developmental neuroscience. This suggests some interesting connections with loss of plasticity (i.e. we see many components have LLC curves that go up, then come down, and one would predict after this decrease the components are more resistent to being changed by further training).
These are potentially consequential changes in model computation that are (in these examples) arguably not noticeable in the loss curve, and it's not obvious to me how you would be confident to notice this from other metrics you would have thought to track (in each case they might correspond with something, like say magnitude of layer norm weights, but it's unclear to me out of all the thousands of things you could measure why you would a priori associate any one such signal with a change in model computation unless you knew it was linked to the LLC curve). Things like the FIM trace or Hessian trace might also reflect the change. However in the second such stage in the linear regression transformer (LR4) this seems not to be the case.
I think that's right, in the sense that this explains a large fraction of our difference in views.
I'm a mathematician, so I suppose in my cosmology we've already travelled 99% of the distance from the upper reaches of the theory stratosphere to the ground and the remaining distance doesn't seem like such an obstacle, but it's fair to say that the proof is in the pudding and the pudding has yet to arrive.
If SLT were to say nontrivial things about what instruction fine-tuning and RLHF are doing to models, and those things were verified in experiments, would that shift your skepticism?
I've been reading some of your other writing:
However, we think that absent substantial advances in science, we're unlikely to develop approaches which substantially improve safety-in-practice beyond baseline methods (e.g., training with RLHF and applying coup probes) without the improvement being captured by black-box control evaluations. We might discuss and argue for this in more detail in a follow-up post.
Could you explain why you are skeptical that current baseline methods can be dramatically improved? It seems possible to me that the major shortcomings of instruction fine-tuning and RLHF (that they seem to make shallow changes to representations and computation) are not fundamental. Maybe it's naive because I haven't thought about this very hard, but from our point of view representations "mature" over development and become rather rigid; however, maybe there's something like Yamanaka factors!
Even from the perspective of black-box control, it seems that as a practical matter one could extract more useful work if the thing in the box is more aligned, and thus it seems you would agree that fundamental advantages in these baseline methods would be welcome.
Incidentally, I don't really understand what you mean by "captured by black-box control evaluations". Was there a follow-up?
The case for singular learning theory (SLT) in AI alignment is just the case for Bayesian statistics in alignment, since SLT is a mathematical theory of Bayesian statistics (with some overly restrictive hypotheses in the classical theory removed).
At a high level the case for Bayesian statistics in alignment is that if you want to control engineering systems that are learned rather than designed, and if that learning means choosing parameters that have high probability with respect to some choice of dataset and model, then it makes sense to understand what the basic structure of that kind of Bayesian learning is (I’ll put aside the potential differences between SGD and Bayesian statistics, since these appear not to be a crux here). I claim that this basic structure is not yet well-understood, that it is nonetheless possible to make fundamental progress on understanding it at both a theoretical and empirical level, and that this understanding will be useful for alignment.
The learning process in Bayesian statistics (what Watanabe and we call the singular learning process) is fundamental, and applies not only to training neural networks, but also to fine-tuning and also to in-context learning. In short, if you expect deep learning models to be “more optimal” over time, and for example to engage in more sophisticated kinds of learning in context (which I do), then you should expect that understanding the learning process in Bayesian statistics should be even more highly relevant in the future than it is today.
One part of the case for Bayesian statistics in alignment is that many questions in alignment seem to boil down to questions about generalisation. If one is producing complex systems by training them to low loss (and perhaps also throwing out models that have low scores on some safety benchmark) then in general there will be many possible configurations with the same low loss and high safety scores. This degeneracy is the central point of SLT. The problem is: how can we determine which of the possible solutions actually realises our intent?
The problem is that our intent is either not entirely encoded in the data, or we cannot be sure that it is, so that questions of generalisation are arguably central in alignment. In present day systems, where alignment engineering looks like shaping the data distribution (e.g. instruction fine-tuning) then a precise form of this question is how models generalise from the (relatively) small number of demonstrations in the fine-tuning dataset.
It therefore seems desirable to have scalable empirical tools for reasoning about generalisation in large neural networks. The learning coefficient in SLT is the obvious theoretical quantity to investigate (in the precise sense that two solutions with the same loss will be differently preferred by the Bayesian posterior, with the one that is “simplest” i.e. has lower learning coefficient, being preferred). That is what we have been doing. One should view the empirical work Timaeus has undertaken as being an exercise in validating that learning coefficient estimation can be done at scale, and reflects real things about networks (so we study situations where we can independently verify things like developmental stages).
Naturally the plan is to take that tool and apply it to actual problems in alignment, but there’s a limit to how fast one can move and still get everything right. I think we’re moving quite fast. In the next few weeks we’ll be posting two papers to the arXiv:
- G. Wang, J. Hoogland, S. van Wingerden, Z. Furman, D. Murfet “Differentiation and Specialization in Language Models via the Restricted Local Learning Coefficient” introduces the weight and data-restricted LLCs and shows that (a) attention heads in a 3M parameter transformer differentiate over training in ways that are tracked by the weight-restricted LLC, (b) some induction heads are partly specialized to code, and this is reflected in the data-restricted LLC on code-related tasks, (c) attention heads follow the pattern that their weight-restricted LLCs first increase then decrease, which appears similar to the critical periods studied by Achille-Rovere-Soatto.
- L. Carroll, J. Hoogland, D. Murfet “Retreat from Ridge: Studying Algorithm Choice in Transformers using Essential Dynamics” studies the retreat from ridge phenomena following Raventós et al and resolves the mystery of apparent non-Bayesianism there, by showing that over training for an in-context linear regression problem there is tradeoff between in-context ridge regression (a simple but high error solution) and another solution more specific to the dataset (which is more complex but lower error). This gives an example of the “accuracy vs simplicity” tradeoff made quantitative by the free energy formula in SLT.
Your concerns about phase transitions (there being potentially too many of them, or this being a bit of an ill-posed framing for the learning process) are well-taken, and indeed these were raised as questions in our original post. The paper on restricted LLCs is basically our response to this.
I think you might buy the high level argument for the role of generalisation in alignment, and understand that SLT says things about generalisation, but wonder if that ever cashes out in something useful. Obviously I believe so, but I'd rather let the work speak for itself. In the next few days there will be a Manifund page explaining our upcoming projects, including applying the LLC estimation techniques we have now proven, to studying things like safety fine-tuning and deceptive alignment in the setting of the “sleeper agents” work.
One final comment. Let me call “inductive strength” the number of empirical conclusions you can draw from some kind of evidence. I claim the inductive strength of fundamental theory validated in experiments, is far greater than experiments not grounded in theory; the ML literature is littered with the corpses of one-off experiments + stories that go nowhere. In my mind this is not what a successful science and engineering practice of AI alignment looks like.
The value of the empirical work Timaeus has done to date largely lies in validating the fundamental claims made by SLT about the singular learning process, and seeing that it applies to systems like small language models. To judge that empirical work by the standard of other empirical work divorced from a deeper set of claims, i.e. purely by “the stuff that it finds”, is to miss the point (to be fair we could communicate this better, but I find it sounds antagonistic written down, as it may do here).
I think scaffolding is the wrong metaphor. Sequences of actions, observations and rewards are just more tokens to be modeled, and if I were running Google I would be busy instructing all work units to start packaging up such sequences of tokens to feed into the training runs for Gemini models. Many seemingly minor tasks (e.g. app recommendation in the Play store) either have, or could have, components of RL built into the pipeline, and could benefit from incorporating LLMs, either by putting the RL task in-context or through fine-tuning of very fast cheap models.
So when I say I don't see a distinction between LLMs and "short term planning agents" I mean that we already know how to subsume RL tasks into next token prediction, and so there is in some technical sense already no distinction. It's a question of how the underlying capabilities are packaged and deployed, and I think that within 6-12 months there will be many internal deployments of LLMs doing short sequences of tasks within Google. If that works, then it seems very natural to just scale up sequence length as generalisation improves.
Arguably fine-tuning a next-token predictor on action, observation, reward sequences, or doing it in-context, is inferior to using algorithms like PPO. However, the advantage of knowledge transfer from the rest of the next-token predictor's data distribution may more than compensate for this on some short-term tasks.
I think this will look a bit outdated in 6-12 months, when there is no longer a clear distinction between LLMs and short term planning agents, and the distinction between the latter and LTPAs looks like a scale difference comparable to GPT2 vs GPT3 rather than a difference in kind. At what point do you imagine a national government saying "here but no further?".
I don't recall what I said in the interview about your beliefs, but what I meant to say was something like what you just said in this post, apologies for missing the mark.
Mumble.
Indeed the integrals in the sparse case aren't so bad https://arxiv.org/abs/2310.06301. I don't think the analogy to the Thompson problem is correct, it's similar but qualitatively different (there is a large literature on tight frames that is arguably more relevant).
Haha this is so intensely on-brand.
The kind of superficial linear extrapolation of trendlines can be powerful, perhaps more powerful than usually accepted in many political/social/futurist discussions. In many cases, succesful forecasters by betting on some high level trend lines often outpredict 'experts'.
But it's a very non-gears level model. I think one should be very careful about using this kind of reasoning when for tail-events.
e.g. this kind of reasoning could lead one to reject development of nuclear weapons.
Agree. In some sense you have to invent all the technology before the stochastic process of technological development looks predictable to you, almost by definition. I'm not sure it is reasonable to ask general "forecasters" about questions that hinge on specific technological change. They're not oracles.
Do you mean the industry labs will take people with MSc and PhD qualifications in CS, math or physics etc and retrain them to be alignment researchers, or do you mean the labs will hire people with undergraduate degrees (or no degree) and train them internally to be alignment researchers?
I don't know how OpenAI or Anthropic look internally, but I know a little about Google and DeepMind through friends, and I have to say the internal incentives and org structure don't strike me as really a very natural environment for producing researchers from scratch.
I think many early-career researchers in AI safety are undervaluing PhDs.
I agree with this. To be blunt, it is my impression from reading LW for the last year that a few people in this community seem to have a bit of a chip on their shoulder Re: academia. It certainly has its problems, and academics love nothing more than pointing them out to each other, but you face your problems with the tools you have, and academia is the only system for producing high quality researchers that is going to exist at scale over the next few years (MATS is great, I'm impressed by what Ryan and co are doing, but it's tiny).
I would like to see many more academics in CS, math, physics and adjacent areas start supervising students in AI safety, and more young people go into those PhDs. Also, more people with PhDs in math and physics transitioning to AI safety work.
One problem is that many of the academics who are willing to supervise PhD students in AI safety or related topics are evaporating into industry positions (subliming?). There are also long run trends that make academia relatively less attractive than it was in the past (e.g. rising corporatisation) even putting aside salary comparisons, and access to compute. So I do worry somewhat about how many PhD students in AI safety adjacent fields can actually be produced per year this decade.
This comment of mine is a bit cheeky, since there are plenty of theoretical computer scientists who think about characterising terms as fixed points, and logic programming is a whole discipline that is about characterising the problem rather than constructing a solution, but broadly speaking I think it is true among less theoretically-minded folks that "program" means "thing constructed step by step from atomic pieces".
Maybe I can clarify a few points here:
- A statistical model is regular if it is identifiable and the Fisher information matrix is everywhere nondegenerate. Statistical models where the prediction involves feeding samples from the input distribution through neural networks are not regular.
- Regular models are the ones for which there is a link between low description length and low free energy (i.e. the class of models which the Bayesian posterior tends to prefer are those that are assigned lower description length, at the same level of accuracy).
- It's not really accurate to describe regular models as "typical", especially not on LW where we are generally speaking about neural networks when we think of machine learning.
- It's true that the example presented in this post is, potentially, not typical (it's not a neural network nor is it a standard kind of statistical model). So it's unclear to what extent this observation generalises. However, it does illustrate the general point that it is a mistake to presume that intuitions based on regular models hold for general statistical models.
- A pervasive failure mode in modern ML is to take intuitions developed for regular models, and assume they hold "with some caveats" for neural networks. We have at this point many examples where this leads one badly astray, and in my opinion the intuition I see widely shared here on LW about neural network inductive biases and description length falls into this bucket.
- I don't claim to know the content of those inductive biases, but my guess is that it is much more interesting and complex than "something like description length".
Yes, good point, but if the prior is positive it drops out of the asymptotic as it doesn't contribute to the order of vanishing, so you can just ignore it from the start.
There was a sign error somewhere, you should be getting + lambda and - (m-1). Regarding the integral from 0 to 1, since the powers involved are even you can do that and double it rather than -1 to 1 (sorry if this doesn't map exactly onto your calculation, I didn't read all the details).
There is some preliminary evidence in favour of the view that transformers approximate a kind of Bayesian inference in-context (by which I mean something like, they look at in-context examples and process them to represent in their activations something like a Bayesian posterior for some "inner" model based on those examples as samples, and then predict using the predictive distribution for that Bayesian posterior). I'll call the hypothesis that this is taking place "virtual Bayesianism".
I'm not saying you should necessarily believe that, for current generation transformers. But fwiw I put some probability on it, and if I had to predict one significant capability advance in the next generation of LLMs it would be to predict that virtual Bayesianism becomes much stronger (in-context learning being a kind of primitive pre-cursor).
Re: the points in your strategic upshots. Given the above, the following question seems quite important to me: putting aside transformers or neural networks, and just working in some abstract context where we consider Bayesian inference on a data distribution that includes sequences of various lengths (i.e. the kinds of distribution that elicits in-context learning), is there a general principle of Bayesian statistics according to which general-purpose search algorithms tend to dominate the Bayesian posterior?
In mathematical terms, what separates agents that could arise from natural selection from a generic agent?
To ask a more concrete question, suppose we consider the framework of DeepMind's Population Based Training (PBT), chosen just because I happen to be familiar with it (it's old at this point, not sure what the current thing is in that direction). This method will tend to produce a certain distribution over parametrised agents, different from the distribution you might get by training a single agent in traditional deep RL style. What are the qualitative differences in these inductive biases?
This is an open question. In practice it seems to work fine even at strict saddles (i.e. things where there are no negative eigenvalues in the Hessian but there are still negative directions, i.e. they show up at higher than second order in the Taylor series), in the sense that you can get sensible estimates and they indicate something about the way structure is developing, but the theory hasn't caught up yet.
I think there's no such thing as parameters, just processes that produce better and better approximations to parameters, and the only "real" measures of complexity have to do with the invariants that determine the costs of those processes, which in statistical learning theory are primarily geometric (somewhat tautologically, since the process of approximation is essentially a process of probing the geometry of the governing potential near the parameter).
From that point of view trying to conflate parameters such that is naive, because aren't real, only processes that produce better approximations to them are real, and so the derivatives of which control such processes are deeply important, and those could be quite different despite being quite similar.
So I view "local geometry matters" and "the real thing are processes approximating parameters, not parameters" as basically synonymous.
You might reconstruct your sacred Jeffries prior with a more refined notion of model identity, which incorporates derivatives (jets on the geometric/statistical side and more of the algorithm behind the model on the logical side).
Except nobody wants to hear about it at parties.
You seem to do OK...
If they only would take the time to explain things simply you would understand.
This is an interesting one. I field this comment quite often from undergraduates, and it's hard to carve out enough quiet space in a conversation to explain what they're doing wrong. In a way the proliferation of math on YouTube might be exacerbating this hard step from tourist to troubadour.
As a supervisor of numerous MSc and PhD students in mathematics, when someone finishes a math degree and considers a job, the tradeoffs are usually between meaning, income, freedom, evil, etc., with some of the obvious choices being high/low along (relatively?) obvious axes. It's extremely striking to see young talented people with math or physics (or CS) backgrounds going into technical AI alignment roles in big labs, apparently maximising along many (or all) of these axes!
Especially in light of recent events I suspect that this phenomenon, which appears too good to be true, actually is.
Please develop this question as a documentary special, for lapsed-Starcraft player homeschooling dads everywhere.
Thanks for setting this up!
I don't understand the strong link between Kolmogorov complexity and generalisation you're suggesting here. I think by "generalisation" you must mean something more than "low test error". Do you mean something like "out of distribution" generalisation (whatever that means)?
Well neural networks do obey Occam's razor, at least according to the formalisation of that statement that is contained in the post (namely, neural networks when formulated in the context of Bayesian learning obey the free energy formula, a generalisation of the BIC which is often thought of as a formalisation of Occam's razor).
I think that expression of Jesse's is also correct, in context.
However, I accept your broader point, which I take to be: readers of these posts may naturally draw the conclusion that SLT currently says something profound about (ii) from my other post, and the use of terms like "generalisation" in broad terms in the more expository parts (as opposed to the technical parts) arguably doesn't make enough effort to prevent them from drawing these inferences.
I have noticed people at the Berkeley meeting and elsewhere believing (ii) was somehow resolved by SLT, or just in a vague sense thinking SLT says something more than it does. While there are hard tradeoffs to make in writing expository work, I think your criticism of this aspect of the messaging around SLT on LW is fair and to the extent it misleads people it is doing a disservice to the ongoing scientific work on this important subject.
I'm often critical of the folklore-driven nature of the ML literature and what I view as its low scientific standards, and especially in the context of technical AI safety I think we need to aim higher, in both our technical and more public-facing work. So I'm grateful for the chance to have this conversation (and to anybody reading this who sees other areas where they think we're falling short, read this as an invitation to let me know, either privately or in posts like this).
I'll discuss the generalisation topic further with the authors of those posts. I don't want to pre-empt their point of view, but it seems likely we may go back and add some context on (i) vs (ii) in those posts or in comments, or we may just refer people to this post for additional context. Does that sound reasonable?
At least right now, the value proposition I see of SLT lies not in explaining the "generalisation puzzle" but in understanding phase transitions and emergent structure; that might end up circling back to say something about generalisation, eventually.
However, I do think that there is another angle of attack on this problem that (to me) seems to get us much closer to a solution (namely, to investigate the properties of the parameter-function map)
Seems reasonable to me!
Re: the articles you link to. I think the second one by Carroll is quite careful to say things like "we can now understand why singular models have the capacity to generalise well" which seems to me uncontroversial, given the definitions of the terms involved and the surrounding discussion.
I agree that Jesse's post has a title "Neural networks generalize because of this one weird trick" which is clickbaity, since SLT does not in fact yet explain why neural networks appear to generalise well on many natural datasets. However the actual article is more nuanced, saying things like "SLT seems like a promising route to develop a better understanding of generalization and the limiting dynamics of training". Jesse gives a long list of obstacles to walking this route. I can't find anything in the post itself to object to. Maybe you think its optimism is misplaced, and fair enough.
So I don't really understand what claims about inductive bias or generalisation behaviour in these posts you think is invalid?
I think that what would probably be the most important thing to understand about neural networks is their inductive bias and generalisation behaviour, on a fine-grained level, and I don't think SLT can tell you very much about that. I assume that our disagreement must be about one of those two claims?
That seems probable. Maybe it's useful for me to lay out a more or less complete picture of what I think SLT does say about generalisation in deep learning in its current form, so that we're on the same page. When people refer to the "generalisation puzzle" in deep learning I think they mean two related but distinct things:
(i) the general question about how it is possible for overparametrised models to have good generalisation error, despite classical interpretations of Occam's razor like the BIC
(ii) the specific question of why neural networks, among all possible overparametrised models, actually have good generalisation error in practice (saying this is possible is much weaker than actually explaining why it happens).
In my mind SLT comes close to resolving (i), modulo a bunch of questions which include: whether the asymptotic limit taking the dataset size to infinity is appropriate in practice, the relationship between Bayesian generalisation error and test error in the ML sense (comes down largely to Bayesian posterior vs SGD), and whether hypotheses like relative finite variance are appropriate in the settings we care about. If all those points were treated in a mathematically satisfactory way, I would feel that the general question is completely resolved by SLT.
Informally, knowing SLT just dispels the mystery of (i) sufficiently that I don't feel personally motivated to resolve all these points, although I hope people work on them. One technical note on this: there are some brief notes in SLT6 arguing that "test error" as a model selection principle in ML, presuming some relation between the Bayesian posterior and SGD, is similar to selecting models based on what Watanabe calls the Gibbs generalisation error, which is computed by both the RLCT and singular fluctuation. Since I don't think it's crucial to our discussion I'll just elide the difference between Gibbs generalisation error in the Bayesian framework and test error in ML, but we can return to that if it actually contains important disagreement.
Anyway I'm guessing you're probably willing to grant (i), based on SLT or your own views, and would agree the real bone of contention lies with (ii).
Any theoretical resolution to (ii) has to involve some nontrivial ingredient that actually talks about neural networks, as opposed to general singular statistical models. The only specific results about neural networks and generalisation in SLT are the old results about RLCTs of tanh networks, more recent bounds on shallow ReLU networks, and Aoyagi's upcoming results on RLCTs of deep linear networks (particularly that the RLCT is bounded above even when you take the depth to infinity).
As I currently understand them, these results are far from resolving (ii). In its current form SLT doesn't supply any deep reason for why neural networks in particular are often observed to generalise well when you train them on a range of what we consider "natural" datasets. We don't understand what distinguishes neural networks from generic singular models, nor what we mean by "natural". These seem like hard problems, and at present it looks like one has to tackle them in some form to really answer (ii).
Maybe that has significant overlap with the critique of SLT you're making?
Nonetheless I think SLT reduces the problem in a way that seems nontrivial. If we boil the "ML in-practice model selection" story to "choose the model with the best test error given fixed training steps" and allow some hand-waving in the connection between training steps and number of samples, Gibbs generalisation error and test error etc, and use Watanabe's theorems (see Appendix B.1 of the quantifying degeneracy paper for a local formulation) to write the Gibbs generalisation error as
where is the learning coefficient and is the singular fluctuation and is roughly the loss (the quantity that we can estimate from samples is actually slightly different, I'll elide this) then (ii), which asks why neural networks on natural datasets have low generalisation error, is at least reduced to the question of why neural networks on natural datasets have low .
I don't know much about this question, and agree it is important and outstanding.
Again, I think this reduction is not trivial since the link between and generalisation error is nontrivial. Maybe at the end of the day this is the main thing we in fact disagree on :)
The easiest way to explain why this is the case will probably be to provide an example. Suppose we have a Bayesian learning machine with 15 parameters, whose parameter-function map is given by
and whose loss function is the KL divergence. This learning machine will learn 4-degree polynomials. Moreover, it is overparameterised, and its loss function is analytic in its parameters, etc, so SLT will apply to it.
In your example there are many values of the parameters that encode the zero function (e.g. and all other parameters free) in addition to there being many parameters that encode the function (e.g. , variables free and ). Without thinking about it more I'm not sure which is actually has local learning coefficient (RLCT) and therefore counts as "more simple" from an SLT perspective.
However, if I understand correctly it's not this specific example that you care about. We can agree that there is some way of coming up with a simple model which (a) can represent both the functions and and (b) has parameters and respectively representing these functions with local learning coefficients . That is, according to the local learning coefficient as a measure of model complexity, the neighbourhood of the parameter is more complex than that of . I believe your observation is that this contradicts an a priori notion of complexity that you hold about these functions.
Is that a fair characterisation of the argument you want to make?
Assuming it is, my response is as follows. I'm guessing you think is simpler than because the former function can be encoded by a shorter code on a UTM than the latter. But this isn't the kind of complexity that SLT talks about: the local learning coefficient that appears in the main theorems represents the complexity of representing a given probability distribution using parameters from the model, and is not some intrinsic model-free complexity of the distribution itself.
One way of saying it is that Kolmogorov complexity is the entropy cost of specifying a machine on the description tape of a UTM (a kind of absolute measure) whereas the local learning coefficient is the entropy cost per sample of incrementally refining an almost true parameter in the neural network parameter space (a kind of relative measure). I believe they're related but not the same notion, as the latter refers fundamentally to a search process that is missing in the former.
We can certainly imagine a learning machine set up in such a way that it is prohibitively expensive to refine an almost true parameter nearby a solution that looks like and very cheap to refine an almost true parameter near a solution like , despite that being against our natural inclination to think of the former as simpler. It's about the nature of the refinement / search process, not directly about the intrinsic complexity of the functions.
So we agree that Kolmogorov complexity and the local learning coefficient are potentially measuring different things. I want to dig deeper into where our disagreement lies, but I think I'll just post this as-is and make sure I'm not confused about your views up to this point.
First of all, SLT is largely is based on examining the behaviour of learning machines in the limit of infinite data
I have often said that SLT is not yet a theory of deep learning, this question of whether the infinite data limit is really the right one being among one of the main question marks I currently see (I think I probably also see the gap between Bayesian learning and SGD as bigger than you do).
I've discussed this a bit with my colleague Liam Hodgkinson, whose recent papers https://arxiv.org/abs/2307.07785 and https://arxiv.org/abs/2311.07013 might be more up your alley than SLT.
My view is that the validity of asymptotics is an empirical question, not something that is settled at the blackboard. So far we have been pleasantly surprised at how well the free energy formula works at relatively low (in e.g. https://arxiv.org/abs/2310.06301). It remains an open question whether this asymptotic continues to provide useful insight into larger models with the kind of dataset size we're using in LLMs for example.
I think that the significance of SLT is somewhat over-hyped at the moment
Haha, on LW that is either already true or at current growth rates will soon be true, but it is clearly also the case that SLT remains basically unknown in the broader deep learning theory community.
I claim that this is fairly uninteresting, because classical statistical learning theory already gives us a fully adequate account of generalisation in this setting which applies to all learning machines, including neural networks
I'm a bit familiar with the PAC-Bayes literature and I think this might be an exaggeration. The linked post merely says that the traditional PAC-Bayes setup must be relaxed, and sketches some ways of doing so. Could you please cite the precise theorem you have in mind?
Very loosely speaking, regions with a low RLCT have a larger "volume" than regions with high RLCT, and the impact of this fact eventually dominates other relevant factors.
I'm going to make a few comments as I read through this, but first I'd like to thank you for taking the time to write this down, since it gives me an opportunity to think through your arguments in a way I wouldn't have done otherwise.
Regarding the point about volume. It is true that the RLCT can be written as (Theorem 7.1 of Watanabe's book "Algebraic Geometry and Statistical Learning Theory")
where is the volume (according to the measure associated to the prior) of the set of parameters with KL divergence between the model and truth less than . For small we have where is the multiplicity. Thus near critical points with lower RLCT small changes in the cutoff near tend to change the volume of the set of almost true parameters more than near critical points with higher RLCTs.
My impression is that you tend to see this as a statement about flatness, holding over macroscopic regions of parameter space, and so you read the asymptotic formula for the free energy (where is a region of parameter space containing a critical point )
as having a term that does little more than prefer critical points that tend to dominate large regions of parameter space according to the prior. If that were true, I would agree this would be underwhelming (or at least, precisely as "whelming" as the BIC, and therefore not adding much beyond the classical story).
However this isn't what the free energy formula says. Indeed the volume is a term that contributes only to the constant order term (this is sketched in Chen et al).
I claim it's better to think of the learning coefficient as being a measure of how many bits it takes to specify an almost true parameter with once you know a parameter with , which is "microscopic" rather than "macroscopic" statement. That is, lower means that a fixed decrease is "cheaper" in terms of entropy generated.
So the free energy formula isn't saying "critical points dominating large regions tend to dominate the posterior at large " but rather "critical points which require fewer bits / less entropy to achieve a fixed dominate the posterior for large ". The former statement is both false and uninteresting, the second statement is true and interesting (or I think so anyway).
Good question. What counts as a "-" is spelled out in the paper, but it's only outlined here heuristically. The "5 like" thing it seems to go near on the way down is not actually a critical point.
The change in the matrix W and the bias b happen at the same time, it's not a lagging indicator.
SLT predicts when this will happen!
Maybe. This is potentially part of the explanation for "data double descent" although I haven't thought about it beyond the 5min I spent writing that page and the 30min I spent talking about it with you at the June conference. I'd be very interested to see someone explore this more systematically (e.g. in the setting of Anthropic's "other" TMS paper https://www.anthropic.com/index/superposition-memorization-and-double-descent which contains data double descent in a setting where the theory of our recent TMS paper might allow you to do something).
There is quite a large literature on "stage-wise development" in neuroscience and psychology, going back to people like Piaget but quite extensively developed in both theoretical and experimental directions. One concrete place to start on the agenda you're outlining here might be to systematically survey that literature from an SLT-informed perspective.
we can copy the relevant parts of the human brain which does the things our analysis of our models said they would do wrong, either empirically (informed by theory of course), or purely theoretically if we just need a little bit of inspiration for what the relevant formats need to look like.
I struggle to follow you guys in this part of the dialogue, could you unpack this a bit for me please?
Though I'm not fully confident that is indeed what they did
The k-gons are critical points of the loss, and as varies the free energy is determined by integrals restricted to neighbourhoods of these critical points in weight space.