Posts
Comments
Ah! Thanks for that - it seems the general playlist organising them has splintered a bit, so here is the channel containing the lectures, the structure of which is explained here. I'll update this post accordingly.
Thanks for writing this out Joar, it is a good exercise of clarification for all of us.
Perhaps a boring comment, but I do want to push back on the title ever so slightly: imo it should be My Criticism of SLT Proponents, i.e. people (like me) who have interpreted some aspects in perhaps an erroneous fashion (according to you).
Sumio Watanabe is incredibly careful to provide highly precise mathematical statements with rigorous proofs and at no point does he make claims about the kind of "real world deep learning" phenomena being discussed here. The only sense in which it seems you critique the theory of SLT itself is that perhaps it isn't as interesting as the number of pages taken to prove its main theorems suggests it should be, but even then it seems you agree that these are non-trivial statements.
I think it is important for people coming to the field for the first time to understand that the mathematical theory is incredibly solid, whilst its interpretation and applicability to broader "real world" problems is still an open question that we are actively working on.
I would argue that the title is sufficiently ambiguous as to what is being claimed, and actually the point of contention in (ii) was discussed in the comments there too. I could have changed it to Why Neural Networks can obey Occam's Razor, but I think this obscures the main point. Regular linear regression could also obey Occam's razor (i.e. "simpler" models are possible) if you set high-order coefficients to 0, but the posterior of such models does not concentrate on those points in parameter space.
At the time of writing, basically nobody knew anything about SLT, so I think it was warranted to err on the side of grabbing attention in the introductory paragraphs and then explaining in detail further on with "we can now understand why singular models have the capacity to generalise well", instead of caveating the whole topic out of existence before the reader knows what is going on.
As we discussed at Berkeley, I do like the polynomial example you give and this whole discussion has made me think more carefully about various aspects of the story, so thanks for that. My inclination is that the polynomial example is actually quite pathological and that there is a reasonable correlation between the RLCT and Kolmogorov complexity in practice (e.g. the one-node subnetwork preferred by the posterior compared to the two-node network in DSLT4), but I don't know enough about Kolmogorov complexity to say much more than that.
Good question! The proof of the exact symmetries of this setup, i.e. the precise form of , is highly dependent on the ReLU. However, the general phenomena I am discussing is applicable well beyond ReLU to other non-linearities. I think there are two main components to this:
- Other non-linearities induce singular models. As you note, other non-linear activation functions do lead to singular models. @mfar did some great work on this for tanh networks. Even though the activation function is important, note that the better intuition to have is that the hierarchical nature of a model (e.g. neural networks) is what makes them singular. Deep linear networks are still singular despite an identity activation function. Think of the activation as giving the model more expressiveness.
- Even if is uninteresting, the loss landscape might be "nearly singular". The ReLU has an analytic approximation, the Swish function , where , which does not yield the same symmetries as discussed in this post. This is because the activation boundaries are no longer a sensible thing to study (the swish function is "always active" in all subsets of the input domain), which breaks down a lot of the analysis used here.
Suppose, however, that we take a that is so large that from the point of view of your computer, (i.e. their difference is within machine-epsilon). Even though is now a very different object to on paper, the loss landscape will be approximately equal , meaning that the Bayesian posterior will be practically identical between the two functions and induce the same training dynamics.
So, whilst the precise functional-equivalences might be very different across activation functions (differing ), there might be many approximate functional equivalences. This is also the sense in which we can wave our arms about "well, SLT only applies to analytic functions, and ReLU isn't analytic, but who cares". Making precise mathematical statements about this "nearly singular" phenomena - for example, how does the posterior change as you lower in ? - is under-explored at present (to the best of my knowledge), but it is certainly not something that discredits SLT for all of the reasons I have just explained.
Edit: Originally the sequence was going to contain a post about SLT for Alignment, but this can now be found here instead, where a new research agenda, Developmental Interpretability, is introduced. I have also now included references to the lectures from the recent SLT for Alignment Workshop in June 2023.
Only in the illegal ways, unfortunately. Perhaps your university has access?
"Discontinuity" might suggest that this happens fast. Yet, e.g. in work on grokking, it actually turns out that these "sudden changes" happen over a majority of the training time (often, the x-axis is on a logarithmic scale). Is this compatible, or would this suggest that phenomena like grokking aren't related to the phase transitions predicted by SLT?
This is a great question and something that come up at the recent summit. We would definitely say that the model is in two different phases before and after grokking (i.e. when the test error is flat), but it's an interesting question to consider whats going on over that long period of time where the error is slowly decreasing. I imagine that it is a relatively large model (from an SLT point of view, which means not very large at all from normal ML pov), meaning there would be a plethora of different singularities in the loss landscape. My best guess is that it is undergoing many phase transitions across that entire period, where it is finding regions of lower and lower RLCT but equal accuracy. I expect there to be some work done in the next few months applying SLT to the grokking work.
As far as I know, modern transformers are often only trained once on each data sample, which should close the gap between SGD time and the number of data samples quite a bit. Do you agree with that perspective?
This is a very interesting point. I broadly agree with this and think it is worth thinking more about, and could be a very useful simplifying assumption in considering the connection between SGD and SLT.
In general, it seems to me that we're probably most interested in phase transitions that happen across SGD time or with more data samples, whereas phase transitions related to other hyperparameters (for example, varying the truth as in your examples here) are maybe less crucial. Would you agree with that?
Broadly speaking, yes. With that said, hyperparameters in the model are probably interesting too (although maybe more from a capabilities standpoint). I think phase transitions in the truth are also probably interesting in the sense of dataset bias, i.e. what changes about a model's behaviour when we include or exclude certain data? Worth noting here that the Toy Models of Superposition work explicitly deals in phase transitions in the truth, so there's definitely a lot of value to be had from studying how variations in the truth induce phase transitions, and what these ramifications are in other things we care about.
Would you expect that most phase transitions in SGD time or the number of data samples are first-order transitions (as is the case when there is a loss-complexity tradeoff), or can you conceive of second-order phase transitions that might be relevant in that context as well?
At a first pass, one might say that second-order phase transitions correspond to something like the formation of circuits. I think there are definitely reasons to believe both happen during training.
Which altered the posterior geometry, but not that of since (up to a normalisation factor).
I didn't understand this footnote.
I just mean that is not affected by (even though of course or is), but the posterior is still affected by . So the phase transition merely concerns the posterior and not the loss landscape.
but the node-degeneracy and orientation-reversing symmetries only occur under precise configurations of the truth.
Hhm, I thought that these symmetries are about configurations of the parameter vector, irrespective of whether it is the "true" vector or not.
My use of the word "symmetry" here is probably a bit confusing and a hangover from my thesis. What I mean is, these two configurations are only in the set of true parameters in this setup when the truth is configured in a particular way. In other words, they are always local minima of , but not always global minima. (This is what PT1 shows when ). Thanks for pointing this out.
It seems to me that in the other phase, the weights also annihilate each other, so the "non-weight annihilation phase" is a somewhat weird terminology. Or did I miss something?
Huh, I'd never really thought of this, but I now agree it is slightly weird terminology in some sense. I probably should have called them the weight-cancellation and non-weight-cancellation phases as I described in the reply to your DSLT3 comment. My bad. I think its a bit too late to change now, though.
I think there is a typo and you meant .
Thanks! And thanks for reading all of the posts so thoroughly and helping clarify a few sloppy pieces of terminology and notation, I really appreciate it.
However, is it correct that we need the "underlying truth" to study symmetries that come from other degeneracies of the Fisher information matrix? After all, this matrix involves the true distribution in its definition. The same holds for the Hessian of the KL divergence.
The definition of the Fisher information matrix does not refer to the truth whatsoever. (Note that in the definition I provide I am assuming the supervised learning case where we know the input distribution , meaning the model is , which is why the shows up in the formula I just linked to. The derivative terms do not explicitly include because it just vanishes in the derivative anyway, so its irrelevant there. But remember, we are ultimately interested in modelling the conditional true distribution in .)
What do you mean with non-weight-annihilation here? Don't the weights annihilate in both pictures?
You're right, thats sloppy terminology from me. What I mean is, in the right hand picture (that I originally labelled WA), there is a region in which all nodes are active, but cancel out to give zero effective gradient, which is markedly different to the left hand picture. I have edited this to NonWC and WC instead to clarify, thanks!
Now, for the KL-divergence, the situation seems more extreme: The zero's are also, at the same time, the minima of , and thus, the derivative disappears at every point in the set . This suggests every point in is singular. Is this correct?
Correct! So, the point is that things get interesting when is more than just a single point (which is the regular case). In essence, singularities are local minima of . In the non-realisable case this means they are zeroes of the minimum-loss level set. In fact we can abuse notation a bit and really just refer to any local minima of as a singularity. The TLDR of this is:
So far, I thought "being singular" means the effective number of parameters around the singularity is lower than the full number of parameters. Also, I thought that it's about the rank of the Hessian, not the vanishing of the derivative. Both perspectives contradict the interpretation in the preceding paragraph, which leaves me confused.
As I show in the examples in DSLT1, having degenerate Fisher information (i.e. degenerate Hessian at zeroes) comes in two essential flavours: having rank-deficiency, and having vanishing second-derivative (i.e. ). Precisely, suppose is the number of parameters, then you are in the regular case if can be expressed as a full-rank quadratic form near each singularity,
Anything less than this is a strictly singular case.
I vaguely remember that there is a part in the MDL book by Grünwald where he explains how using a good prior such as Jeffrey's prior somewhat changes asymptotic behavior for , but I'm not certain of that.
Watanabe has an interesting little section in the grey book [Remark 7.4, Theorem 7.4, Wat09] talking about the Jeffrey's prior. I haven't studied it in detail but to the best of my reading he is basically saying "from the point of view of SLT, the Jeffrey's prior is zero at singularities anyway, its coordinate-free nature makes it inappropriate for statistical learning, and the RLCT can only be if the Jeffrey's prior is employed." (The last statement is the content of the theorem where he studies the poles of the zeta function when the Jeffrey's prior is employed).
Can you tell more about why it is a measure of posterior concentration.
...
Are you claiming that most of that work happens very localized in a small parameter region?
Given a small neighbourhood , the free energy is and measures the posterior concentration in since
where the inner term is the posterior, modulo its normalisation constant . The key here is that if we are comparing different regions of parameter space , then the free energy doesn't care about that normalisation constant as it is just a shift in by a constant. So the free energy gives you a tool for comparing different regions of the posterior. (To make this comparison rigorous, I suppose one would want to make sure that these regions are the same "size". Another perspective, and really the main SLT perspective, is that if they are sufficiently small and localised around different singularities then this size problem isn't really relevant, and the free energy is telling you something about the structure of the singularity and the local geometry of around the singularity).
I am not quite sure what you mean with "it tells us something about the information geometry of the posterior"
This is sloppily written by me, apologies. I merely mean to say "the free energy tells us what models the posterior likes".
I didn't find a definition of the left expression.
I mean, the relation between and tells you that this is a sensible thing to write down, and if you reconstructed the left side from the right side you would simply find some definition in terms of the predictive distribution restricted to (instead of in the integral).
Purposefully naive question: can I just choose a region that contains all singularities? Then it surely wins, but this doesn't help us because this region can be very large.
Yes - and as you say, this would be very uninteresting (and in general you wouldn't know what to pick necessarily [although we did in the phase transition DSLT4 because of the classification of in DSLT3]). The point is that at no point are you just magically "choosing" a anyway. If you really want to calculate the free energy of some model setup then you would have a reason to choose different phases to analyse. Otherwise, the premise of this section of the post is to show that the geometry depends on the singularity structure and this varies across parameter space.
Possible correction: one of those points isn't a singularity, but a regular loss-minimizing point (as you also clarify further below).
As discussed in the comment in your DSLT1 question, they are both singularities of since they are both critical points (local minima). But they are not both true parameters, nor are they both regular points with RLCT .
How do you think about this? What are sensible choices of priors (or network initializations) from the SLT perspective?
I think sensible choices of priors has an interesting and not-interesting angle to it. The interesting answer might involve something along the lines of reformulating the Jeffreys prior, as well as noticing that a Gaussian prior gives you a "regularisation" term (and can be thought of as adding the "simple harmonic oscillator" part to the story). The uninteresting answer is that SLT doesn't care about the prior (other than its regularity conditions) since it is irrelevant in the limit. Also if you were concerned with the requirement for to be compact, you can just define it to be compact on the space of "numbers that my computer can deal with".
Also, I find it curious that in the second example, the posterior will converge to the lowest loss, but SGD would not since it wouldn't "manage to get out of the right valley", I assume. This seems to suggest that the Bayesian view of SGD can at most be true in high dimensions, but not for very low-dimensional neural networks. Would you agree with that, or what is your perspective?
Yes! We are thinking very much about this at the moment and I think this is the correct intuition to have. If one runs SGD on the potential wells , you find that it just gets stuck in the basin it was closest to. So, what's going on in high dimensions? It seems something about the way higher dimensional spaces are different from lower ones is relevant here, but it's very much an open problem.
Thanks for the comment Leon! Indeed, in writing a post like this, there are always tradeoffs in which pieces of technicality to dive into and which to leave sufficiently vague so as to not distract from the main points. But these are all absolutely fair questions so I will do my best to answer them (and make some clarifying edits to the post, too). In general I would refer you to my thesis where the setup is more rigorously explained.
Should I think of this as being equal to , and would you call this quantity ? I was a bit confused since it seems like we're not interested in the data likelihood, but only the conditional data likelihood under model .
The partition function is equal to the model evidence , yep. It isn’t equal to (I assume is fixed here?) but is instead expressed in terms of the model likelihood and prior (and can simply be thought of as the “normalising constant” of the posterior),
and then under this supervised learning setup where we know , we have . Also note that this does “factor over ” (if I’m interpreting you correctly) since the data is independent and identically distributed.
But the free energy does not depend on the parameter, so how should I interpret this claim? Are you already one step ahead and thinking about the singular case where the loss landscape decomposes into different "phases" with their own free energy?
Yep, you caught me - I was one step ahead. The free energy over the whole space is still a very useful quantity as it tells you “how good” the best model in the model class is. But by itself doesn’t tell you much about what else is going on in the loss landscape. For that, you need to localise to smaller regions and analyse their phase structure, as presented in DSLT2.
I think the first expression should either be an expectation over , or have the conditional entropy within the parantheses.
Ah, yes, you are right - this is a notational hangover from my thesis where I defined to be equal to expectation with respect to the true distribution . (Things get a little bit sloppy when you have this known floating around everywhere - you eventually just make a few calls on how to write the cleanest notation, but I agree that in the context of this post it’s a little confusing so I apologise).
I briefly tried showing this and somehow failed. I didn't quite manage to get rid of the integral over . Is this simple? (You don't need to show me how it's done, but maybe mentioning the key idea could be useful)
See Lemma A.2 in my thesis. One uses a fairly standard argument involving the first central moment of a Gaussian.
The rest of the article seems to mainly focus on the case of the Fisher information matrix. In particular, you didn't show an example of a non-regular model where the Fisher information matrix is positive definite everywhere.
Is it correct to assume models which are merely non-regular because the map from parameters to distributions is non-injective aren't that interesting, and so you maybe don't even want to call them singular?
Yep, the rest of the article does focus on the case where the Fisher information matrix is degenerate because it is far more interesting and gives rise to an interesting singularity structure (i.e. most of the time it will yield an RLCT ). Unless my topology is horrendously mistaken, if one has a singular model class for which every parameter has a positive definite Fisher information, then this implies the non-identifiability condition simply means you have a set of isolated points that all have the same RLCT . Thus the free energy will only depend on their inaccuracy , meaning every optimal parameter has the same free energy - not particularly interesting! An example of this would be something like the permutation symmetry of ReLU neural networks that I discuss in DSLT3.
I found this slightly ambiguous, also because under your definitions further down, it seems like "singular" (degenerate Fisher information matrix) is a stronger condition then "strictly singular" (degenerate Fisher information matrix OR non-injective map from parameters to distributions).
I have clarified the terminology in the section where they are defined - thanks for picking me up on that. In particular, a singular model class can be either strictly singular or regular - Watanabe’s results hold regardless of identifiability or the degeneracy of the Fisher information. (Sometimes I might accidentally use the word "singular" to emphasise a model which "has non-regular points" - the context should make it relatively clear).
What is in this formula? Is it fixed? Or do we average the derivatives over the input distribution?
Refer to Theorem 3.1 and Lemma 3.2 in my thesis. The Fisher information involves an integral wrt , so the Fisher information is degenerate iff that set is dependent as a function of , in other words, for all values in the domain specified by (well, more precisely, for all non-measure-zero regions as specified by ).
Hhm, I thought having a singular model just means that some singularities are degenerate.
Typo - thanks for that.
One unrelated conceptual question: when I see people draw singularities in the loss landscape, for example in Jesse's post, they often "look singular": i.e., the set of minimal points in the loss landscape crosses itself. However, this doesn't seem to actually be the case: a perfectly smooth curve of loss-minimizing points will consist of singularities because in the direction of the curve, the derivative does not change. Is this correct?
Correct! When we use the word “singularity”, we are specifically referring to singularities of in the sense of algebraic geometry, so they are zeroes (or zeroes of a level set), and critical points with . So, even in regular models, the single optimal parameter is a singularity of - it just a really, really uninteresting one. In SLT, every singularity needs to be put into normal crossing form via the resolution of singularities, regardless of whether it is a singularity in the sense that you describe (drawing self-intersecting curves, looking at cusps, etc.). But for cartoon purposes, those sorts of curves are good visualisation tools.
I think you forgot a in the term of degree 1.
Typo - thanks.
Could you explain why that is? I may have missed some assumption on or not paid attention to something.
If you expand that term out you find that
because the second integral is the first central moment of a Gaussian. The derivative of the prior is irrelevant.
Hhm. Is the claim that if the loss of the function does not change along some curve in the parameter space, then the function itself remains invariant? Why is that?
This is a fair question. When concerning the zeroes, by the formula for when the truth is realisable one shows that
so any path in the set of true parameters (i.e. in this case the set ) will indeed produce the same input-output function. In general (away from the zeroes of ), I don’t think this is necessarily true but I’d have to think a bit harder about it. In this pathological case it is, but I wouldn’t get bogged down in it - I’m just saying “ tells us one parameter can literally be thrown out without changing anything about the model”. (Note here that is literally a free parameter across all of ).
Are you sure this is the correct formula? When I tried computing this by hand it resulted in , but maybe I made a mistake.
Ah! Another typo - thank you very much. It should be
General unrelated question: is the following a good intuition for the correspondence of the volume with the effective number of parameters around a singularity? The larger the number of effective parameters around , the more blows up around in all directions because we get variation in all directions, and so the smaller the region where is below . So contributes to this volume. This is in fact what it does in the formulas, by being an exponent for small .
I think that's a very reasonable intuition to have, yep! Moreover, if one wants to compare the "flatness" between versus , the point is that within a small neighbourhood of the singularity, a higher exponent (RLCTs of and respectively here) is "much flatter" than a low coefficient (the ). This is what the RLCT is picking up.
Do you currently expect that gradient descent will do something similar, where the parameters will move toward singularities with low RLCT? What's the state of the theory regarding this?
We do expect that SGD is roughly equivalent to sampling from the Bayesian posterior and therefore that it moves towards regions of low RLCT, yes! But this is nonetheless just a postulate for the moment. If one treats as a Hamiltonian energy function, then you can apply a full-throated physics lens to this entire setup (see DSLT4) and see that the critical points of strongly affect the trajectories of the particles. Then the connection between SGD and SLT is really just the extent to which SGD is “acting like a particle subject to a Hamiltonian potential”. (A variant called SGLD seems to be just that, so maybe the question is under what conditions / to what extent does SGD = SGLD?). Running experiments that test whether variants of SGD end up in low RLCT regions of is definitely a fruitful path forward.
Thanks for writing that, I look forward to reading.
As for nomenclature, I did not define it - the sequence is called Distilling SLT, and this is the definition offered by Watanabe. But to add some weight to it, the point is that in the Bayesian setting, the predictive distribution is a reasonable object to study from the point of view of generalisation, because it says: "what is the probability of this output given this input and given the data of the posterior". The Bayes training loss (which I haven't delved into in this post) is the empirical counterpart to the Bayes generalisation loss,
and so it adds up the "entropy" of the predictive distribution over the training datapoints - if the predictive distribution is certain about all training datapoints, then the training loss will be 0. (Admittedly, this is an object that I am still getting my head around).
The Bayes generalisation loss satisfies , and therefore it averages this training loss over the whole space of inputs and outputs. This is the sense in which it is reasonable to call it "generalisation". As I say in the post, there are other ways you can think of generalisation in the Bayesian setting, like the leave-one-out-cross-validation, or other methods of extracting data from the posterior like Gibbs sampling. Watanabe shows (see pg236 of [Wat18]) that the RLCT is a central object in all of these alternative conceptions.
As for the last point, neural networks do not "find" anything themselves - either an optimisation method like SGD does, or the Bayesian posterior "finds" regions of high posterior concentration (i.e. it contains this information). SLT tells us that the Bayesian posterior of singular models does "find" low RLCT points (as long as they are sufficiently accurate). Neural networks are singular models (as I will explain in DSLT3), so the posterior of neural networks "finds" low RLCT points.
Does this mean SGD does? We don't know yet! And whether your intuition says this is a fruitful line of research to investigate is completely personal to your own mental model of the world, I suppose.
With all due respect, I think you are misrepresenting what I am saying here. The sentence after your quote ends is
its relation to SGD dynamics is certainly an open question.
What is proven by Watanabe is that the Bayesian generalisation error, as I described in detail in the post, strongly depends on the singularity structure of the minima of , as measured by the RLCT . This fact is proven in [Wat13] and explained in more detail in [Wat18]. As I elaborate on in the post, translating this statement into the SGD / frequentist setting is an interesting and important open problem, if it can be done at all.
What is your understanding? It is indeed a deep mathematical theory, but it is not convoluted. Watanabe proves the FEF, and shows the RLCT is the natural generalisation of complexity in this setting. There is a long history of deep/complicated mathematics, with natural (and beautiful) theorems at the core, being pivotal to describing real world phenomena.
The point of the posts is not to argue that we can prove why particular architectures perform better than others (yet). This field has had, comparatively, very little work done to it yet within AI research, and these sorts of facts are where SLT might take us (modulo AI capabilities concerns). The point is to demonstrate the key insights of the theory and signpost the fact that “hey, there might be something very meaningful here.” What we can predict with the theory is why certain phase transitions happen, in particular the two layer feedforward ReLU nets I will show in DSLT4. This is a seed from which to generalise to deeper nets and more intricate architectures - the natural way of doing good mathematics.
As to the “singularity” problem, you will have to take that up with the algebraic geometers who have been studying singularities for over 50 years. The fact is that optimal parameters are singularities of K(w) in non-trivial neural networks - hence, singular learning theory.
What do “convergence properties” and “generalisation ability“ mean to you, precisely? It is an indisputable fact that Watanabe proves (and I elaborate on in the post) that this singularity structure plays a central role in Bayesian generalisation. As I say, its relation to SGD dynamics is certainly an open question. But in the Bayesian setting, the case is really quite closed.
The way I've structured the sequence means these points are interspersed throughout the broader narrative, but its a great question so I'll provide a brief summary here, and as they are released I will link to the relevant sections in this comment.
- In regular model classes, the set of true parameters that minimise the loss is a single point. In singular model classes, it can be significantly more than a single point. Generally, it is a higher-dimensional structure. See here in DSLT1.
- In regular model classes, every point on can be approximated by a quadratic form. In singular model classes, this is not true. Thus, asymptotic normality of regular models (i.e. every regular posterior is just a Gaussian as ) breaks down. See here in DSLT1, or here in DSLT2.
- Relatedly, the Bayesian Information Criterion (BIC) doesn't hold in singular models, because of a similar problem of non-quadraticness. Watanabe generalises the BIC to singular models with the WBIC, which shows complexity is measured by the RLCT , not the dimension of parameter space . The RLCT satisfies in general, and when its regular. See here in DSLT2.
- In regular model classes, every parameter has the same complexity . In singular model classes, different parameters have different complexities according to their RLCT. See here in DSLT2.
- If you have a fixed model class, the BIC can only be minimised by optimising the accuracy (see here). But in a singular model class, the WBIC can be minimised according to an accuracy-complexity tradeoff. So "simpler" models exist on singular loss landscapes, but every model is equally complex in regular models. See here in DSLT2.
- With this latter point in mind, phase transitions are anticipated in singular models because the free energy is comprised of accuracy and complexity, which is different across parameter space. In regular models, since complexity is fixed, phase transitions are far less natural or interesting, in general.
If a model is singular, then Watanabe’s Free Energy Formula (FEF) can have big implications for the geometry of the loss landscape. Whether or not a particular neural network model is singular does indeed depend on its activation function, amongst other structures in its architecture.
In DSLT3 I will outline the ways simple two layer feedforward ReLU neural networks are singular models (ie I will show the symmetries in parameter space that produce the same input-output function), which is generalisable to deeper feedforward ReLU networks. There I will also discuss similar results for tanh networks, alluding to the fact that there are many (but not all) activation functions that produce these symmetries, thus making neural networks with those activation functions singular models, thus meaning the content and interpretation of Watanabe’s free energy formula is applicable.