Posts
Comments
There are ~two broad ways of thinking about RG. 'HEP-like' is structural; it helps roughly map out the space of theories at different scales/couplings, even if it does not capture all of physics at all times (for example, the standard model is a good description up to some energy scale, but each sector is often treated independently). Different aspects of training and inference seems to take the shape of renormalization, although as we pointed out there is a lot of work to be done to understand the various scales and couplings inherent to NNs. A goal of this opportunity space is not to make RG 'go backwards', but to correctly map this renormalization picture onto a 'space of representations' of a NN theory. I don't expect this to be easy or simple, but am hopeful that it will shed more theoretical insight onto different training and inference regimes and point to insightful conceptual gaps.
In contrast, interpreting NNs to a certain degree of (human-specified) abstraction is more CMT-like. I suspect you are focusing on this perspective overall. Again, as we pointed out, this formulation of RG is not invertible, there are likely many ways to coarse grain a representation (just as there are many RG schemes in condensed matter), but that doesn't mean you can't reframe something like a sparse autoencoder as doing RG.
Sure, there are 'lots of phase transitions', but that doesn't mean that it's pointless to try and classify or describe some interesting one (like memorization -> generalization). Similarly, just because lots of interesting physics happens far from phase transitions doesn't mean they aren't useful 'tethers' in the space of theories (phi-4 theory has been pretty useful).
Regarding attention, I agree that a theory of representations here would be non-local with respect to inputs. That's fine. Long range dependencies just change the flavor of RG. A lot depends on how we measure locality, since there's no a priori natural way to do this in NNs (although there are several options that likely pick up on different relationships between features).
I think @Dmitry Vaintrob has more context on this, but I'd be really interested in exploring this idea more. I'm not sure we'd be able to enumerate all possible ways to renormalize an interpretation, but I agree that having a portfolio RG-inspired lenses could help. Are you imagining that each agent would run the lenses over information from other agents and aggregate them in some way (how?)?
Thanks for the comment! The way I think about it, there are several ways of thinking about RG in terms of different 'energy' analogues in NNs, and each is likely tied to a different framing in terms of 'UV' and 'IR'.
For example, during training, you start with a simplified (IR like) description of the dataset that flows to a richer representation, adding finer grained structure capable of generalizing (UV).
During inference, I agree that you can describe this process as UV -> IR, as each layer is a progressively coarser representation as the features that are irrelevant for a certain task (like classification) are 'integrated out' to yield a usefully abstract simplification. However, you can also think of inference in terms of 'feature refinement', where each layer becomes progressively more structured, able to pick up on finer or more abstract details. This ultimately depends on how you think of 'scale' along the RG flow.
Thanks for the recommendation! The pathways of scientific progress here seem very interesting (for example: physics -> neuro -> AI -> ... v. physics -> AI -> neuro -> ...), particularly if we think about feeding back between experimental and theoretical support to build up a general understanding. Physics is really good at fitting theories together in a mosaic -- at a large scale you have a nice picture of the universe, and the tiles (theories) fit together but aren't part of the same continuous picture, allowing for some separation between different regimes of validity. It's not a perfect analogy, but it says something about physics' ability to split the difference between reductionism and emergence. It would be nice to have a similar picture in neuroscience (and AI), though this might be more difficult.
For me, this question of the relevant scale(s) is the main point of introducing this work. d/w is one example of a cutoff, and one associated with the data distribution is another, but more work needs to be done to understand how to relate potentially different theoretical descriptions (for example, how these two cutoffs work together). We also mention the 'lattice as regulator' as a natural cut-off for physical systems, and hope to find similarly natural scales in real-world AI systems.
here
For example: https://www.lesswrong.com/posts/EhTMM77iKBTBxBKRe/the-laws-of-large-numbers
Thanks for the comment! I do hope that the thoughts expressed here can inspire some action, but I'm not sure I understand your questions. Do you mean 'centralized', or are you thinking about the conditions necessary for many small scale trading zones?
In this way, I guess the emergence of big science could be seen as a phase transition from decentralization -> centralization.
Thanks for your comment! It got me thinking a bit more about big science in general, and led to this post. I'd be curious to hear your thoughts.