Posts
Comments
Is there a reason why the Pearson correlation coefficient of the data in Figure 14 is not reported? This correlation is referred to numerous times throughout the paper.
There's no general theoretical reason that I am aware of to expect a relation between the L2 norm and the LLC. The LLC is the coefficient of the term in the asymptotic expansion of the free energy (negative logarithm of the integral of the posterior over a local region, as a function of sample size ) while the L2 norm of the parameter shows up in the constant order term of that same expansion, if you're taking a Gaussian prior.
It might be that in particular classes of neural networks there is some architecture-specific correlation between the L2 norm and the LLC, but I am not aware of any experimental or theoretical evidence for that.
For example, in the figure below from Hoogland et al 2024 we see that there are later stages of training in a transformer trained to do linear-regression in context (blue shaded regions) where the LLC is decreasing but the L2 norm is increasing. So the model is moving towards a "simpler" parameter with larger weight norm.
My best current guess is that it happens to be, in the grokking example, that the simpler solution has smaller weight norm. This could be true in many synthetic settings, for all I know; however, in general, it is not the case that complexity (at least as far as SLT is concerned) and weight norm are correlated.
That simulation sounds cool. The talk certainly doesn't contain any details and I don't have a mathematical model to share at this point. One way to make this more concrete is to think through Maxwell's demon as an LLM, for example in the context of Feynman's lectures on computation. The literature on thermodynamics of computation (various experts, like Adam Shai and Paul Riechers, are around here and know more than me) implicitly or explicitly touches on relevant issues.
The analogous laws are just information theory.
Re: a model trained on random labels. This seems somewhat analogous to building a power plant out of dark matter; to derive physical work it isn't enough to have some degrees of freedom somewhere that have a lot of energy, one also needs a chain of couplings between those degrees of freedom and the degrees of freedom you want to act on. Similarly, if I want to use a model to reduce my uncertainty about something, I need to construct a chain of random variables with nonzero mutual information linking the question in my head to the predictive distribution of the model.
To take a concrete example: if I am thinking about a chemistry question, and there are four choices A, B, C, D. Without any other information than these letters the model cannot reduce my uncertainty (say I begin with equal belief in all four options). However if I provide a prompt describing the question, and the model has been trained on chemistry, then this information sets up a correspondence between this distribution over four letters and something the model knows about; its answer may then reduce my distribution to being equally uncertain between A, B but knowing C, D are wrong (a change of 1 bit in my entropy).
Since language models are good general compressors this seems to work in reasonable generality.
Ideally we would like the model to push our distribution towards true answers, but it doesn't necessarily know true answers, only some approximation; thus the work being done is nontrivially directed, and has a systematic overall effect due to the nature of the model's biases.
I don't know about evolution. I think it's right that the perspective has limits and can just become some empty slogans outside of some careful usage. I don't know how useful it is in actually technically reasoning about AI safety at scale, but it's a fun idea to play around with.
Marcus Hutter on AIXI and ASI safety
Yes this seems like an important question but I admit I don't have anything coherent to say yet. A basic intuition from thermodynamics is that if you can measure the change in the internal energy between two states, and the heat transfer, you can infer how much work was done even if you're not sure how it was done. So maybe the problem is better thought of as learning to measure enough other quantities that one can infer how much cognitive work is being done.
For all I know there is a developed thermodynamic theory of learning agents out there which already does this, but I didn't find it yet...
The description of love at the conclusion of Gene Wolfe's The Wizard gets at something important, if you read it as something that both parties are simultaneously doing.
The work of Ashby I'm familiar with is "An Introduction to Cybernetics" and I'm referring to the discussion in Chapter 11 there. The references you're giving seem to be invoking the "Law" of requisite variety in the context of arguing that an AGI has to be relatively complex in order to maintain homeostatis in a complex environment, but this isn't the application of the law I have in mind.
From the book:
The law of Requisite Variety says that R's capacity as a regulator cannot exceed R's capacity as a channel of communication.
In the form just given, the law of Requisite Variety can be shown in exact relation to Shannon's Theorem 10, which says that if noise appears in a message, the amount of noise that can be removed by a correction channel is limited to the amount of information that can be carried by that channel.
Thus, his "noise" corresponds to our "disturbance", his "correction channel" to our "regulator R", and his "message of entropy H" becomes, in our case, a message of entropy zero, for it is constancy that is to be "transmitted": Thus the use of a regulator to achieve homeostasis and the use of a correction channel to suppress noise are homologous.
and
A species continues to exist primarily because its members can block the flow of variety (thought of as disturbance) to the gene-pattern, and this blockage is the species’ most fundamental need. Natural selection has shown the advantage to be gained by taking a large amount of variety (as information) partly into the system (so that it does not reach the gene-pattern) and then using this information so that the flow via R blocks the flow through the environment T.
This last quote makes clear I think what I have in mind: the environment is full of advanced AIs, they provide disturbances D, and in order to regulate the effects of those disturbances on our "cognitive genetic material" there is some requirement on the "correction channel". Maybe this seems a bit alien to the concept of control. There's a broader set of ideas I'm toying with, which could be summarised as something like "reciprocal control" where you have these channels of communication / regulation going in both directions (from human to machine, and vice versa).
The Queen's Dilemma was a little piece of that picture, which attempts to illustrate this bi-directional control flow by having the human control the machine (by setting its policy, say) and the machine control the human (in an emergent fashion, that being the dilemma).
Is restricting human agency fine if humans have little control over where it is restricted and to what degree?
Re: your first point. I think I'm still a bit confused here and that's partly why I wanted to write this down and have people poke at it. Following Sen (but maybe I'm misunderstanding him) I'm not completely convinced I know how to factor human agency into "winning". One part of me wants to say that whatever notion of agency I have, in some sense it's a property of world states and in principle I could extract it with enough monitoring of my brain or whatever, and then any prescribed tradeoff between "measured sense of agency" and "score" is something I could give to the machine as a goal.
So then I end up with the machine giving me the precise amount of leeway that lets me screw up the game just right for my preferences.
I don't see a fundamental problem with that, but it's also not the part of the metaphor that seems most interesting to me. What I'm more interested in is human inferiority as a pattern, and the way that pattern pervades the overall system and translates into computational structure, perhaps in surprising and indirect ways.
I'll reply in a few branches. Re: stochastic chess. I think there's a difference between a metaphor and a toy model; this is a metaphor, and the ingredients are chosen to illustrate in a microcosm some features I think are relevant about the full picture. The speed differential, and some degree of stochasticity, seem like aspects of human intervention in AI systems that seem meaningful to me.
I do agree that if one wanted to isolate the core phenomena here mathematically and study it, chess might not be the right toy model.
The metaphor is a simplification, in practice I think it is probably impossible to know whether you have achieved complete alignment. The question is then: how significant is the gap? If there is an emergent pressure across the vast majority of learning machines that dominate your environment to push you from de facto to de jure control, not due to malign intent but just as a kind of thermodynamic fact, then the alignment gap (no matter how small) seems to loom larger.
Re: the SLT dogma.
For those interested, a continuous version of the padding argument is used in Theorem 4.1 of Clift-Murfet-Wallbridge to show that the learning coefficient is a lower bound on the Kolmogorov complexity (in a sense) in the setting of noisy Turing machines. Just take the synthesis problem to be given by a TM's input-output map in that theorem. The result is treated in a more detailed way in Waring's thesis (Proposition 4.19). Noisy TMs are of course not neural networks, but they are a place where the link between the learning coefficient in SLT and algorithmic information theory has already been made precise.
For what it's worth, as explained in simple versus short, I don't actually think the local learning coefficient is algorithmic complexity (in the sense of program length) in neural networks, only that it is a lower bound. So I don't really see the LLC as a useful "approximation" of the algorithmic complexity.
For those wanting to read more about the padding argument in the classical setting, Hutter-Catt-Quarel "An Introduction to Universal Artificial Intelligence" has a nice detailed treatment.
Typo, I think you meant singularity theory :p
Modern mathematics is less about solving problems within established frameworks and more about designing entirely new games with their own rules. While school mathematics teaches us to be skilled players of pre-existing mathematical games, research mathematics requires us to be game designers, crafting rule systems that lead to interesting and profound consequences
I don't think so. This probably describes the kind of mathematics you aspire to do, but still the bulk of modern research in mathematics is in fact about solving problems within established frameworks and usually such research doesn't require us to "be game designers". Some of us are of course drawn to the kinds of frontiers where such work is necessary, and that's great, but I think this description undervalues the within-paradigm work that is the bulk of what is going on.
It might be worth knowing that some countries are participating in the "network" without having formal AI safety institutes
I hadn't seen that Wattenberg-Viegas paper before, nice.
Yeah actually Alexander and I talked about that briefly this morning. I agree that the crux is "does this basic kind of thing work" and given that the answer appears to be "yes" we can confidently expect scale (in both pre-training and inference compute) to deliver significant gains.
I'd love to understand better how the RL training for CoT changes the representations learned during pre-training.
My observation from the inside is that size and bureaucracy in Universities has something to do with what you're talking about, but more to do with a kind of "organisational overfitting" where small variations of the organisation's experience that included negative outcomes are responded to by internal process that necessitates headcount (aligning the incentives for response with what you're talking about).
I think self-repair might have lower free energy, in the sense that if you had two configurations of the weights, which "compute the same thing" but one of them has self-repair for a given behaviour and one doesn't, then the one with self-repair will have lower free energy (which is just a way of saying that if you integrate the Bayesian posterior in a neighbourhood of both, the one with self-repair gives you a higher number, i.e. its preferred).
That intuition is based on some understanding of what controls the asymptotic (in the dataset size) behaviour of the free energy (which is -log(integral of posterior over region)) and the example in that post. But to be clear it's just intuition. It should be possible to empirically check this somehow but it hasn't been done.
Basically the argument is self-repair => robustness of behaviour to small variations in the weights => low local learning coefficient => low free energy => preferred
I think by "specifically" you might be asking for a mechanism which causes the self-repair to develop? I have no idea.
It's a fascinating phenomenon. If I had to bet I would say it isn't a coping mechanism but rather a particular manifestation of a deeper inductive bias of the learning process.
In terms of more subtle predictions. In the Berkeley Primer in mid-2023, based on elementary manipulations of the free energy formula, I predicted we should see phase transitions / developmental stages where the loss stays relatively constant but the LLC (model complexity) decreases.
We noticed one such stage in the language models, and two in the linear regression transformers in the developmental landscape paper. We only partially understood them there, but we've seen more behaviour like this in the upcoming work I mentioned in my other post, and we feel more comfortable now linking it to phenomena like "pruning" in developmental neuroscience. This suggests some interesting connections with loss of plasticity (i.e. we see many components have LLC curves that go up, then come down, and one would predict after this decrease the components are more resistent to being changed by further training).
These are potentially consequential changes in model computation that are (in these examples) arguably not noticeable in the loss curve, and it's not obvious to me how you would be confident to notice this from other metrics you would have thought to track (in each case they might correspond with something, like say magnitude of layer norm weights, but it's unclear to me out of all the thousands of things you could measure why you would a priori associate any one such signal with a change in model computation unless you knew it was linked to the LLC curve). Things like the FIM trace or Hessian trace might also reflect the change. However in the second such stage in the linear regression transformer (LR4) this seems not to be the case.
I think that's right, in the sense that this explains a large fraction of our difference in views.
I'm a mathematician, so I suppose in my cosmology we've already travelled 99% of the distance from the upper reaches of the theory stratosphere to the ground and the remaining distance doesn't seem like such an obstacle, but it's fair to say that the proof is in the pudding and the pudding has yet to arrive.
If SLT were to say nontrivial things about what instruction fine-tuning and RLHF are doing to models, and those things were verified in experiments, would that shift your skepticism?
I've been reading some of your other writing:
However, we think that absent substantial advances in science, we're unlikely to develop approaches which substantially improve safety-in-practice beyond baseline methods (e.g., training with RLHF and applying coup probes) without the improvement being captured by black-box control evaluations. We might discuss and argue for this in more detail in a follow-up post.
Could you explain why you are skeptical that current baseline methods can be dramatically improved? It seems possible to me that the major shortcomings of instruction fine-tuning and RLHF (that they seem to make shallow changes to representations and computation) are not fundamental. Maybe it's naive because I haven't thought about this very hard, but from our point of view representations "mature" over development and become rather rigid; however, maybe there's something like Yamanaka factors!
Even from the perspective of black-box control, it seems that as a practical matter one could extract more useful work if the thing in the box is more aligned, and thus it seems you would agree that fundamental advantages in these baseline methods would be welcome.
Incidentally, I don't really understand what you mean by "captured by black-box control evaluations". Was there a follow-up?
The case for singular learning theory (SLT) in AI alignment is just the case for Bayesian statistics in alignment, since SLT is a mathematical theory of Bayesian statistics (with some overly restrictive hypotheses in the classical theory removed).
At a high level the case for Bayesian statistics in alignment is that if you want to control engineering systems that are learned rather than designed, and if that learning means choosing parameters that have high probability with respect to some choice of dataset and model, then it makes sense to understand what the basic structure of that kind of Bayesian learning is (I’ll put aside the potential differences between SGD and Bayesian statistics, since these appear not to be a crux here). I claim that this basic structure is not yet well-understood, that it is nonetheless possible to make fundamental progress on understanding it at both a theoretical and empirical level, and that this understanding will be useful for alignment.
The learning process in Bayesian statistics (what Watanabe and we call the singular learning process) is fundamental, and applies not only to training neural networks, but also to fine-tuning and also to in-context learning. In short, if you expect deep learning models to be “more optimal” over time, and for example to engage in more sophisticated kinds of learning in context (which I do), then you should expect that understanding the learning process in Bayesian statistics should be even more highly relevant in the future than it is today.
One part of the case for Bayesian statistics in alignment is that many questions in alignment seem to boil down to questions about generalisation. If one is producing complex systems by training them to low loss (and perhaps also throwing out models that have low scores on some safety benchmark) then in general there will be many possible configurations with the same low loss and high safety scores. This degeneracy is the central point of SLT. The problem is: how can we determine which of the possible solutions actually realises our intent?
The problem is that our intent is either not entirely encoded in the data, or we cannot be sure that it is, so that questions of generalisation are arguably central in alignment. In present day systems, where alignment engineering looks like shaping the data distribution (e.g. instruction fine-tuning) then a precise form of this question is how models generalise from the (relatively) small number of demonstrations in the fine-tuning dataset.
It therefore seems desirable to have scalable empirical tools for reasoning about generalisation in large neural networks. The learning coefficient in SLT is the obvious theoretical quantity to investigate (in the precise sense that two solutions with the same loss will be differently preferred by the Bayesian posterior, with the one that is “simplest” i.e. has lower learning coefficient, being preferred). That is what we have been doing. One should view the empirical work Timaeus has undertaken as being an exercise in validating that learning coefficient estimation can be done at scale, and reflects real things about networks (so we study situations where we can independently verify things like developmental stages).
Naturally the plan is to take that tool and apply it to actual problems in alignment, but there’s a limit to how fast one can move and still get everything right. I think we’re moving quite fast. In the next few weeks we’ll be posting two papers to the arXiv:
- G. Wang, J. Hoogland, S. van Wingerden, Z. Furman, D. Murfet “Differentiation and Specialization in Language Models via the Restricted Local Learning Coefficient” introduces the weight and data-restricted LLCs and shows that (a) attention heads in a 3M parameter transformer differentiate over training in ways that are tracked by the weight-restricted LLC, (b) some induction heads are partly specialized to code, and this is reflected in the data-restricted LLC on code-related tasks, (c) attention heads follow the pattern that their weight-restricted LLCs first increase then decrease, which appears similar to the critical periods studied by Achille-Rovere-Soatto.
- L. Carroll, J. Hoogland, D. Murfet “Retreat from Ridge: Studying Algorithm Choice in Transformers using Essential Dynamics” studies the retreat from ridge phenomena following Raventós et al and resolves the mystery of apparent non-Bayesianism there, by showing that over training for an in-context linear regression problem there is tradeoff between in-context ridge regression (a simple but high error solution) and another solution more specific to the dataset (which is more complex but lower error). This gives an example of the “accuracy vs simplicity” tradeoff made quantitative by the free energy formula in SLT.
Your concerns about phase transitions (there being potentially too many of them, or this being a bit of an ill-posed framing for the learning process) are well-taken, and indeed these were raised as questions in our original post. The paper on restricted LLCs is basically our response to this.
I think you might buy the high level argument for the role of generalisation in alignment, and understand that SLT says things about generalisation, but wonder if that ever cashes out in something useful. Obviously I believe so, but I'd rather let the work speak for itself. In the next few days there will be a Manifund page explaining our upcoming projects, including applying the LLC estimation techniques we have now proven, to studying things like safety fine-tuning and deceptive alignment in the setting of the “sleeper agents” work.
One final comment. Let me call “inductive strength” the number of empirical conclusions you can draw from some kind of evidence. I claim the inductive strength of fundamental theory validated in experiments, is far greater than experiments not grounded in theory; the ML literature is littered with the corpses of one-off experiments + stories that go nowhere. In my mind this is not what a successful science and engineering practice of AI alignment looks like.
The value of the empirical work Timaeus has done to date largely lies in validating the fundamental claims made by SLT about the singular learning process, and seeing that it applies to systems like small language models. To judge that empirical work by the standard of other empirical work divorced from a deeper set of claims, i.e. purely by “the stuff that it finds”, is to miss the point (to be fair we could communicate this better, but I find it sounds antagonistic written down, as it may do here).
I think scaffolding is the wrong metaphor. Sequences of actions, observations and rewards are just more tokens to be modeled, and if I were running Google I would be busy instructing all work units to start packaging up such sequences of tokens to feed into the training runs for Gemini models. Many seemingly minor tasks (e.g. app recommendation in the Play store) either have, or could have, components of RL built into the pipeline, and could benefit from incorporating LLMs, either by putting the RL task in-context or through fine-tuning of very fast cheap models.
So when I say I don't see a distinction between LLMs and "short term planning agents" I mean that we already know how to subsume RL tasks into next token prediction, and so there is in some technical sense already no distinction. It's a question of how the underlying capabilities are packaged and deployed, and I think that within 6-12 months there will be many internal deployments of LLMs doing short sequences of tasks within Google. If that works, then it seems very natural to just scale up sequence length as generalisation improves.
Arguably fine-tuning a next-token predictor on action, observation, reward sequences, or doing it in-context, is inferior to using algorithms like PPO. However, the advantage of knowledge transfer from the rest of the next-token predictor's data distribution may more than compensate for this on some short-term tasks.
I think this will look a bit outdated in 6-12 months, when there is no longer a clear distinction between LLMs and short term planning agents, and the distinction between the latter and LTPAs looks like a scale difference comparable to GPT2 vs GPT3 rather than a difference in kind. At what point do you imagine a national government saying "here but no further?".
I don't recall what I said in the interview about your beliefs, but what I meant to say was something like what you just said in this post, apologies for missing the mark.
Mumble.
Indeed the integrals in the sparse case aren't so bad https://arxiv.org/abs/2310.06301. I don't think the analogy to the Thompson problem is correct, it's similar but qualitatively different (there is a large literature on tight frames that is arguably more relevant).
Haha this is so intensely on-brand.
The kind of superficial linear extrapolation of trendlines can be powerful, perhaps more powerful than usually accepted in many political/social/futurist discussions. In many cases, succesful forecasters by betting on some high level trend lines often outpredict 'experts'.
But it's a very non-gears level model. I think one should be very careful about using this kind of reasoning when for tail-events.
e.g. this kind of reasoning could lead one to reject development of nuclear weapons.
Agree. In some sense you have to invent all the technology before the stochastic process of technological development looks predictable to you, almost by definition. I'm not sure it is reasonable to ask general "forecasters" about questions that hinge on specific technological change. They're not oracles.
Do you mean the industry labs will take people with MSc and PhD qualifications in CS, math or physics etc and retrain them to be alignment researchers, or do you mean the labs will hire people with undergraduate degrees (or no degree) and train them internally to be alignment researchers?
I don't know how OpenAI or Anthropic look internally, but I know a little about Google and DeepMind through friends, and I have to say the internal incentives and org structure don't strike me as really a very natural environment for producing researchers from scratch.
I think many early-career researchers in AI safety are undervaluing PhDs.
I agree with this. To be blunt, it is my impression from reading LW for the last year that a few people in this community seem to have a bit of a chip on their shoulder Re: academia. It certainly has its problems, and academics love nothing more than pointing them out to each other, but you face your problems with the tools you have, and academia is the only system for producing high quality researchers that is going to exist at scale over the next few years (MATS is great, I'm impressed by what Ryan and co are doing, but it's tiny).
I would like to see many more academics in CS, math, physics and adjacent areas start supervising students in AI safety, and more young people go into those PhDs. Also, more people with PhDs in math and physics transitioning to AI safety work.
One problem is that many of the academics who are willing to supervise PhD students in AI safety or related topics are evaporating into industry positions (subliming?). There are also long run trends that make academia relatively less attractive than it was in the past (e.g. rising corporatisation) even putting aside salary comparisons, and access to compute. So I do worry somewhat about how many PhD students in AI safety adjacent fields can actually be produced per year this decade.
This comment of mine is a bit cheeky, since there are plenty of theoretical computer scientists who think about characterising terms as fixed points, and logic programming is a whole discipline that is about characterising the problem rather than constructing a solution, but broadly speaking I think it is true among less theoretically-minded folks that "program" means "thing constructed step by step from atomic pieces".
Maybe I can clarify a few points here:
- A statistical model is regular if it is identifiable and the Fisher information matrix is everywhere nondegenerate. Statistical models where the prediction involves feeding samples from the input distribution through neural networks are not regular.
- Regular models are the ones for which there is a link between low description length and low free energy (i.e. the class of models which the Bayesian posterior tends to prefer are those that are assigned lower description length, at the same level of accuracy).
- It's not really accurate to describe regular models as "typical", especially not on LW where we are generally speaking about neural networks when we think of machine learning.
- It's true that the example presented in this post is, potentially, not typical (it's not a neural network nor is it a standard kind of statistical model). So it's unclear to what extent this observation generalises. However, it does illustrate the general point that it is a mistake to presume that intuitions based on regular models hold for general statistical models.
- A pervasive failure mode in modern ML is to take intuitions developed for regular models, and assume they hold "with some caveats" for neural networks. We have at this point many examples where this leads one badly astray, and in my opinion the intuition I see widely shared here on LW about neural network inductive biases and description length falls into this bucket.
- I don't claim to know the content of those inductive biases, but my guess is that it is much more interesting and complex than "something like description length".
Yes, good point, but if the prior is positive it drops out of the asymptotic as it doesn't contribute to the order of vanishing, so you can just ignore it from the start.
There was a sign error somewhere, you should be getting + lambda and - (m-1). Regarding the integral from 0 to 1, since the powers involved are even you can do that and double it rather than -1 to 1 (sorry if this doesn't map exactly onto your calculation, I didn't read all the details).
There is some preliminary evidence in favour of the view that transformers approximate a kind of Bayesian inference in-context (by which I mean something like, they look at in-context examples and process them to represent in their activations something like a Bayesian posterior for some "inner" model based on those examples as samples, and then predict using the predictive distribution for that Bayesian posterior). I'll call the hypothesis that this is taking place "virtual Bayesianism".
I'm not saying you should necessarily believe that, for current generation transformers. But fwiw I put some probability on it, and if I had to predict one significant capability advance in the next generation of LLMs it would be to predict that virtual Bayesianism becomes much stronger (in-context learning being a kind of primitive pre-cursor).
Re: the points in your strategic upshots. Given the above, the following question seems quite important to me: putting aside transformers or neural networks, and just working in some abstract context where we consider Bayesian inference on a data distribution that includes sequences of various lengths (i.e. the kinds of distribution that elicits in-context learning), is there a general principle of Bayesian statistics according to which general-purpose search algorithms tend to dominate the Bayesian posterior?
In mathematical terms, what separates agents that could arise from natural selection from a generic agent?
To ask a more concrete question, suppose we consider the framework of DeepMind's Population Based Training (PBT), chosen just because I happen to be familiar with it (it's old at this point, not sure what the current thing is in that direction). This method will tend to produce a certain distribution over parametrised agents, different from the distribution you might get by training a single agent in traditional deep RL style. What are the qualitative differences in these inductive biases?
This is an open question. In practice it seems to work fine even at strict saddles (i.e. things where there are no negative eigenvalues in the Hessian but there are still negative directions, i.e. they show up at higher than second order in the Taylor series), in the sense that you can get sensible estimates and they indicate something about the way structure is developing, but the theory hasn't caught up yet.
I think there's no such thing as parameters, just processes that produce better and better approximations to parameters, and the only "real" measures of complexity have to do with the invariants that determine the costs of those processes, which in statistical learning theory are primarily geometric (somewhat tautologically, since the process of approximation is essentially a process of probing the geometry of the governing potential near the parameter).
From that point of view trying to conflate parameters such that is naive, because aren't real, only processes that produce better approximations to them are real, and so the derivatives of which control such processes are deeply important, and those could be quite different despite being quite similar.
So I view "local geometry matters" and "the real thing are processes approximating parameters, not parameters" as basically synonymous.
You might reconstruct your sacred Jeffries prior with a more refined notion of model identity, which incorporates derivatives (jets on the geometric/statistical side and more of the algorithm behind the model on the logical side).
Except nobody wants to hear about it at parties.
You seem to do OK...
If they only would take the time to explain things simply you would understand.
This is an interesting one. I field this comment quite often from undergraduates, and it's hard to carve out enough quiet space in a conversation to explain what they're doing wrong. In a way the proliferation of math on YouTube might be exacerbating this hard step from tourist to troubadour.
As a supervisor of numerous MSc and PhD students in mathematics, when someone finishes a math degree and considers a job, the tradeoffs are usually between meaning, income, freedom, evil, etc., with some of the obvious choices being high/low along (relatively?) obvious axes. It's extremely striking to see young talented people with math or physics (or CS) backgrounds going into technical AI alignment roles in big labs, apparently maximising along many (or all) of these axes!
Especially in light of recent events I suspect that this phenomenon, which appears too good to be true, actually is.
Please develop this question as a documentary special, for lapsed-Starcraft player homeschooling dads everywhere.
Thanks for setting this up!
I don't understand the strong link between Kolmogorov complexity and generalisation you're suggesting here. I think by "generalisation" you must mean something more than "low test error". Do you mean something like "out of distribution" generalisation (whatever that means)?
Well neural networks do obey Occam's razor, at least according to the formalisation of that statement that is contained in the post (namely, neural networks when formulated in the context of Bayesian learning obey the free energy formula, a generalisation of the BIC which is often thought of as a formalisation of Occam's razor).
I think that expression of Jesse's is also correct, in context.
However, I accept your broader point, which I take to be: readers of these posts may naturally draw the conclusion that SLT currently says something profound about (ii) from my other post, and the use of terms like "generalisation" in broad terms in the more expository parts (as opposed to the technical parts) arguably doesn't make enough effort to prevent them from drawing these inferences.
I have noticed people at the Berkeley meeting and elsewhere believing (ii) was somehow resolved by SLT, or just in a vague sense thinking SLT says something more than it does. While there are hard tradeoffs to make in writing expository work, I think your criticism of this aspect of the messaging around SLT on LW is fair and to the extent it misleads people it is doing a disservice to the ongoing scientific work on this important subject.
I'm often critical of the folklore-driven nature of the ML literature and what I view as its low scientific standards, and especially in the context of technical AI safety I think we need to aim higher, in both our technical and more public-facing work. So I'm grateful for the chance to have this conversation (and to anybody reading this who sees other areas where they think we're falling short, read this as an invitation to let me know, either privately or in posts like this).
I'll discuss the generalisation topic further with the authors of those posts. I don't want to pre-empt their point of view, but it seems likely we may go back and add some context on (i) vs (ii) in those posts or in comments, or we may just refer people to this post for additional context. Does that sound reasonable?
At least right now, the value proposition I see of SLT lies not in explaining the "generalisation puzzle" but in understanding phase transitions and emergent structure; that might end up circling back to say something about generalisation, eventually.
However, I do think that there is another angle of attack on this problem that (to me) seems to get us much closer to a solution (namely, to investigate the properties of the parameter-function map)
Seems reasonable to me!