Posts
Comments
IIRC @jake_mendel and @Kaarel have thought about this more, but my rough recollection is: a simple story about the regularization seems sufficient to explain the training dynamics, so a fancier SLT story isn't obviously necessary. My guess is that there's probably something interesting you could say using SLT, but nothing that simpler arguments about the regularization wouldn't tell you also. But I haven't thought about this enough.
Good catch, thanks! Fixed now.
It's worth noting that Jesse is mostly following the traditional "approximation, generalization, optimization" error decomposition from learning theory here - where "generalization" specifically refers to finite-sample generalization (gap between train/test loss), rather than something like OOD generalization. So e.g. a failure of transformers to solve recursive problems would be a failure of approximation, rather than a failure of generalization. Unless I misunderstood you?
Repeating a question I asked Jesse earlier, since others might be interested in the answer: how come we tend to hear more about PAC bounds than MAC bounds?
Note that in the SLT setting, "brains" or "neural networks" are not the sorts of things that can be singular (or really, have a certain ) on their own - instead they're singular for certain distributions of data.
This is a good point I often see neglected. Though there's some sense in which a model can "be singular" independent of data: if the parameter-to-function map is not locally injective. Then, if a distribution minimizes the loss, the preimage of in parameter space can have non-trivial geometry.
These are called "degeneracies," and they can be understood for a particular model without talking about data. Though the actual that minimizes the loss is determined by data, so it's sort of like the "menu" of degeneracies are data-independent, and the data "selects one off the menu." Degeneracies imply singularities, but not necessarily vice-versa, so they aren't everything. But we do think that degeneracies will be fairly important in practice.
A possible counterpoint, that you are mostly advocating for awareness as opssosed to specific points is null, since pretty much everyone is aware of the problem now - both society as a whole, policymakers in particular, and people in AI research and alignment.
I think this specific point is false, especially outside of tech circles. My experience has been that while people are concerned about AI in general, and very open to X-risk when they hear about it, there is zero awareness of X-risk beyond popular fiction. It's possible that my sample isn't representative here, but I would expect that to swing in the other direction, given that the folks I interact with are often well-educated New-York-Times-reading types, who are going to be more informed than average.
Even among those aware, there's also a difference between far-mode "awareness" in the sense of X-risk as some far away academic problem, and near-mode "awareness" in the sense of "oh shit, maybe this could actually impact me." Hearing a bunch of academic arguments, but never seeing anybody actually getting fired up or protesting, will implicitly cause people to put X-risk in the first bucket. Because if they personally believed it to be big a near-term risk, they'd certainly be angry and protesting, and if other people aren't, that's a signal other people don't really take it seriously. People sense a missing mood here and update on it.
In the cybersecurity analogy, it seems like there are two distinct scenarios being conflated here:
1) Person A says to Person B, "I think your software has X vulnerability in it." Person B says, "This is a highly specific scenario, and I suspect you don't have enough evidence to come to that conclusion. In a world where X vulnerability exists, you should be able to come up with a proof-of-concept, so do that and come back to me."
2) Person B says to Person A, "Given XYZ reasoning, my software almost certainly has no critical vulnerabilities of any kind. I'm so confident, I give it a 99.99999%+ chance." Person A says, "I can't specify the exact vulnerability your software might have without it in front of me, but I'm fairly sure this confidence is unwarranted. In general it's easy to underestimate how your security story can fail under adversarial pressure. If you want, I could name X hypothetical vulnerability, but this isn't because I think X will actually be the vulnerability, I'm just trying to be illustrative."
Story 1 seems to be the case where "POC or GTFO" is justified. Story 2 seems to be the case where "security mindset" is justified.
It's very different to suppose a particular vulnerability exists (not just as an example, but as the scenario that will happen), than it is to suppose that some vulnerability exists. Of course in practice someone simply saying "your code probably has vulnerabilities," while true, isn't very helpful, so you may still want to say "POC or GTFO" - but this isn't because you think they're wrong, it's because they haven't given you any new information.
Curious what others have to say, but it seems to me like this post is more analogous to story 2 than story 1.
I wish I had a more short-form reference here, but for anyone who wants to learn more about this, Rocket Propulsion Elements is the gold standard intro textbook. We used in my university rocketry group, and it's a common reference to see in industry. Fairly well written, and you should only need to know high school physics and calculus.
Obviously this is all speculation but maybe I'm saying that the universal approximation theorem implies that neural architectures are fractal in space of all distributtions (or some restricted subset thereof)?
Oh I actually don't think this is speculation, if (big if) you satisfy the conditions for universal approximation then this is just true (specifically that the image of is dense in function space). Like, for example, you can state Stone-Weierstrass as: for a Hausdorff space X, and the continuous functions under the sup norm , the Banach subalgebra of polynomials is dense in . In practice you'd only have a finite-dimensional subset of the polynomials, so this obviously can't hold exactly, but as you increase the size of the polynomials, they'll be more space-filling and the error bound will decrease.
Curious what's your beef with universal approximation? Stone-weierstrass isn't quantitative - is that the reason?
The problem is that the dimension of required to achieve a given error bound grows exponentially with the dimension of your underlying space . For instance, if you assume that weights depend continuously on the target function, -approximating all functions on with Sobolev norm provably takes at least parameters (DeVore et al.). This is a lower bound.
So for any realistic universal approximation is basically useless - the number of parameters required is enormous. Which makes sense because approximation by basis functions is basically the continuous version of a lookup table.
Because neural networks actually work in practice, without requiring exponentially many parameters, this also tells you that the space of realistic target functions can't just be some generic function space (even with smoothness conditions), it has to have some non-generic properties to escape the lower bound.
Sorry, I realized that you're mostly talking about the space of true distributions and I was mainly talking about the "data manifold" (related to the structure of the map for fixed ). You can disregard most of that.
Though, even in the case where we're talking about the space of true distributions, I'm still not convinced that the image of under needs to be fractal. Like, a space-filling assumption sounds to me like basically a universal approximation argument - you're assuming that the image of densely (or almost densely) fills the space of all probability distributions of a given dimension. But of course we know that universal approximation is problematic and can't explain what neural nets are actually doing for realistic data.
Very interesting, glad to see this written up! Not sure I totally agree that it's necessary for to be a fractal? But I do think you're onto something.
In particular you say that "there are points in the larger dimensional space that are very (even arbitrarily) far from ," but in the case of GPT-4 the input space is discrete, and even in the case of e.g. vision models the input space is compact. So the distance must be bounded.
Plus if you e.g. sample a random image, you'll find there's usually a finite distance you need to travel in the input space (in L1, L2, etc) until you get something that's human interpretable (i.e. lies on the data manifold). So that would point against the data manifold being dense in the input space.
But there is something here, I think. The distance usually isn't that large until you reach a human interpretable image, and it's quite easy to perturb images slightly to have completely different interpretations (both to humans and ML systems). A fairly smooth data manifold wouldn't do this. So my guess is that the data "manifold" is in fact not a manifold globally, but instead has many self-intersections and is singular. That would let it be close to large portions of input space without being literally dense in it. This also makes sense from an SLT perspective. And IIRC there's some empirical evidence that the dimension of the data "manifold" is not globally constant.
if the distribution of intermediate neurons shifts so that Othello-board-state-detectors have a reasonably high probability of being instantiated
Yeah, this "if" was the part I was claiming permutation invariance causes problems for - that identically distributed neurons probably couldn't express something as complicated as a board-state-detector. As soon as that's true (plus assuming the board-state-detector is implemented linearly), agreed, you can recover it with a linear probe regardless of permutation-invariance.
This is a more reasonable objection(although actually, I'm not sure if independence does hold in the tensor programs framework - probably?)
I probably should've just gone with that one, since the independence barrier is the one I usually think about, and harder to get around (related to non-free-field theories, perturbation theory, etc).
My impression from reading through one of the tensor program papers a while back was that it still makes the IID assumption, but there could be some subtlety about that I missed.
The reason the Othello result is surprising to the NTK is that neurons implementing an "Othello board state detector" would be vanishingly rare in the initial distribution, and the NTK thinks that the neuron function distribution does not change during training.
Yeah, that's probably the best way to explain why this is surprising from the NTK perspective. I was trying to include mean-field and tensor programs as well (where that explanation doesn't work anymore).
As an example, imagine that our input space consisted of five pixels, and at initialization neurons were randomly sensitive to one of the pixels. You would easily be able to construct linear probes sensitive to individual pixels even though the distribution over neurons is invariant over all the pixels.
Yeah, this is a good point. What I meant to specify wasn't that you can't recover any permutation-sensitive data at all (trivially, you can recover data about the input), but that any learned structures must be invariant to neuron permutation. (Though I'm feeling sketchy about the details of this claim). For the case of NTK, this is sort of trivial, since (as you pointed out) it doesn't really learn features anyway.
By the way, there are actually two separate problems that come from the IID assumption: the "independent" part, and the "identically-distributed" part. For space I only really mentioned the second one. But even if you deal with the identically distributed assumption, the independence assumption still causes problems.This prevents a lot of structure from being representable - for example, a layer where "at most two neurons are activated on any input from some set" can't be represented with independently distributed neurons. More generally a lot of circuit-style constructions require this joint structure. IMO this is actually the more fundamental limitation, though takes longer to dig into.
I think the core surprising thing is the fact that the model learns a representation of the board state. The causal / linear probe parts are there to ensure that you've defined "learns a representation of the board state" correctly - otherwise the probe could just be computing the board state itself, without that knowledge being used in the original model.
This is surprising to some older theories like statistical learning, because the model is usually treated as effectively a black box function approximator. It's also surprising to theories like NTK, mean-field, and tensor programs, because they view model activations as IID samples from a single-neuron probability distribution - but you can't reconstruct the board state via a permutation-invariant linear probe. The question of "which neuron is which" actually matters, so this form of feature learning is beyond them. (Though there may be e.g. perturbative modifications to these theories to allow this in a limited way).
Yeah, that was what I was referring to. Maybe “algorithmic model” isn’t the most precise - what we know is that the NN has an internal model of the board state that’s causal (i.e. the NN actually uses it to make predictions, as verified by interventions). Theoretically it could just be forming this internal model via a big lookup table / function approximation, rather than via a more sophisticated algorithm. Though we’ve seen from modular addition work, transformer induction heads, etc that at least some of the time NNs learn genuine algorithms.
Agreed - that alone isn’t particularly much, just one of the easier things to express succinctly. (Though the fact that this predates deep learning does seem significant to me. And the fact that SLT can delineate precisely where statistical learning theory went wrong here seems important too.)
Another is that can explain phenomena like phase transitions, as observed in e.g. toy models of superposition, at a quantitative level. There’s also been a substantial chunk of non-SLT ML literature that has independently rediscovered small pieces of SLT, like failures of information geometry, importance of parameter degeneracies, etc. More speculatively, but what excites me most, is that empirical phenomena like grokking, mode connectivity, and circuits seem to intuitively fit in SLT nicely, though this hasn’t been demonstrated rigorously yet.
Yeah, I can expand on that - this is obviously going be fairly opinionated, but there are a few things I'm excited about in this direction.
The first thing that comes to mind here is singular learning theory. I think all of my thoughts on DL theory are fairly strongly influenced by it at this point. It definitely doesn't have all the answers at the moment, but it's the single largest theory I've found that makes deep learning phenomena substantially "less surprising" (bonus points for these ideas preceding deep learning). For instance, one of the first things that SLT tells you is that the effective parameter count (RLCT) of your model can vary depending on the training distribution, allowing it to basically do internal model selection - the absence of bias-variance tradeoff, and the success of overparameterized models, aren't surprising when you internalize this. The "connection to real world structure" aspect hasn't been fully developed here, but it seems heavily suggested by the framework, in multiple ways - for instance, hierarchical statistical models are naturally singular statistical models, and the hierarchical structure is reflected in the singularities. (See also Tom Waring's thesis).
Outside of SLT, there's a few other areas I'm excited about - I'll highlight just one. You mentioned Lin, Tegmark, and Rolnick - the broader literature on depth separations and the curse of dimensionality seems quite important. The approximation abilities of NNs are usually glossed over with universal approximation arguments, but this can't be enough - for generic Lipschitz functions, universal approximation takes exponentially many parameters in the input dimension (this is a provable lower bound). So there has to be something special about the functions we care about in the real world. See this section of my post for more information. I'd highlight Poggio et al. here, which is the paper in the literature closest to my current view on this.
This isn't a complete list, even of theoretical areas that I think could specifically help address the "real world structure" connection, but these are the two I'd feel bad not mentioning. This doesn't include some of the more empirical findings in science of DL that I think are relevant, like simplicity bias, mode connectivity, grokking, etc. Or work outside DL that could be helpful to draw on, like Boolean circuit complexity, algorithmic information theory, natural abstractions, etc.
I can't speak for Richard, but I think I have a similar issue with NTK and adjacent theory as it currently stands (beyond the usual issues). I'm significantly more confident in a theory of deep learning if it cleanly and consistently explains (or better yet, predicts) unexpected empirical phenomena. The one that sticks out most prominently in my mind, that we see constantly in interpretability, is this strange correspondence between the algorithmic "structure" we find in trained models (both ML and biological!) and "structure" in the data generating process.
That training on Othello move sequences gets you an algorithmic model of the game itself is surprising from most current theoretical perspectives! So in that sense I might be suspicious of a theory of deep learning that fails to "connect our understanding of neural networks to our understanding of the real world". This is the single most striking thing to come out of interpretability, in my opinion, and I'm worried about a "deep learning theory of everything" if it doesn't address this head on.
That said, NTK doesn't promise to be a theory of everything, so I don't mean to hold it to an unreasonable standard. It does what it says on the tin! I just don't think it's explained a lot of the remaining questions I have. I don't think we're in a situation where "we can explain 80% of a given model's behavior with the NTK" or similar. And this is relevant for e.g. studying inductive biases, as you mentioned.
But I strong upvoted your comment, because I do think deep learning theory can fill this gap - I'm personally trying to work in this area. There are some tractable-looking directions here, and people shouldn't neglect them!
Someone with better SLT knowledge might want to correct this, but more specifically:
Studying the "volume scaling" of near-min-loss parameters, as beren does here, is really core to SLT. The rate of change of this volume as you change your epsilon loss tolerance is called the "density of states" (DOS) function, and much of SLT basically boils down to an asymptotic analysis of this function. It also relates the terms in the asymptotic expansion to things you care about, like generalization performance.
You might wonder why SLT needs so much heavy machinery, since this sounds so simple - and it's basically because SLT can handle the case where the eigenvalues of the Hessian are zero, and the usual formula breaks down. This is actually important in practice, since IIRC real models often have around 90% zero eigenvalues in their Hessian. It also leads to substantially different theory - for instance the "effective number of parameters" (RLCT) can vary depending on the dataset.
If anyone wants an interactive visual for the correlation upper and lower bounds, I made one for myself here: https://www.math3d.org/chXa4xZrC.
The x, y, and z axes are the correlations between X and Y, and Y and Z, and X and Z respectively. Everything inside the blue surface is possible. There are also some sliders to help visualize level sets.
- Yep, pre-LN transformers avoid the vanishing gradient problem.
- Haven't checked this myself, but the phenomenon seems to be fairly clean? See figure 3.b in the paper I linked, or figure 1 in this paper.
I actually wouldn't think of vanishing/exploding gradients as a pathological training problem but a more general phenomenon about any dynamical system. Some dynamical systems (e.g. the sigmoid map) fall into equilibria over time, getting exponentially close to one. Other dynamical systems (e.g. the logistic map) become chaotic, and similar trajectories diverge exponentially over time. If you check, you'll find the first kind leads to vanishing gradients (at each iteration of the map), and the second to exploding ones. This a forward pass perspective on the problem - the usual perspective on the problem considers only implications for the backward pass, since that's where the problem usually shows up.
Notice above that the system with exponential decay in the forward pass had vanishing gradients (growing gradient norms) in the backward pass - the relationship is inverse. If you start with toy single-neuron networks, you can prove this to yourself pretty easily.
The predictions here are still complicated by a few facts - first, exponential divergence/convergence of trajectories doesn't necessarily imply exponentially growing/shrinking norms. Second, the layer norm complicates things, confining some dynamics to a hypersphere (modulo the zero-mean part). Haven't fully worked out the problem for myself yet, but still think there's a relationship here.
Exponential growth is a fairly natural thing to expect here, roughly for the same reason that vanishing/exploding gradients happen (input/output sensitivity is directly related to param/output sensitivity). Based on this hypothesis, I'm preregistering the prediction that (all other things equal) the residual stream in post-LN transformers will exhibit exponentially shrinking norms, since it's known that post-LN transformers are more sensitive to vanishing gradient problems compared to pre-LN ones.
Edit: On further thought, I still think this intuition is correct, but I expect the prediction is wrong - the notion of relative residual stream size in a post-LN transformer is a bit dubious, since the size of the residual stream is entirely determined by the layer norm constants, which are a bit arbitrary because they can be rolled into other weights. I think the proper prediction is more around something like Lyapunov exponents.
Great discussion here!
Leaving a meta-comment about priors: on one hand, almost-linear features seem very plausible (a priori) for almost-linear neural networks; on the other, linear algebra is probably the single mathematical tool I'd expect ML researchers to be incredibly well-versed in, and the fact that we haven't found a "smoking gun" at this point with so much potential scrutiny makes me suspect.
And while this is a very natural hypothesis to test, and I'm excited for people to do so, it seems possible that the field's familiarity with linear methods is a hammer that makes everything look like a nail. It's easy to focus on linear interpretability because the alternative seems too hard (a response I often get) - I think this is wrong, and there are tractable directions in the nonlinear case too, as long as you're willing to go slightly further afield.
I also have some skepticism on the object-level here too, but it was taking me too long to write it up, so that will have to wait. I think this is definitely a topic worth spending more time on - appreciate the post!
A bit of a side note, but I don't even think you need to appeal to new architectures - it looks like the NTK approximation performs substantially worse even with just regular MLPs (see this paper, among others).
For anyone who wants to play around with this themselves, you might be interested in a small Colab notebook I made, with some interactive 2D and 3D plots.
To be clear, I don't know the answer to this!
Spitballing here, the key question to me seems to be about the OOD generalization behavior of ML models. Models that receive similarly low loss on the training distribution still have many different ways they can behave on real inputs, so we need to know what generalization strategies are likely to be learned for a given architecture, training procedure, and dataset. There is some evidence in this direction, suggesting that ML models are biased towards a simplicity prior over generalization strategies.
If this is true, then the incredibly handwave-y solution is to just create a dataset where the simplest (good) process for estimating labels is to emulate an aligned human. At first pass this actually looks quite easy - it's basically what we're doing with language models already.
Unfortunately there's quite a lot we swept under the rug. In particular this may not scale up as models get more powerful - the prior towards simplicity can be overcome if it results in lower loss, and if the dataset contains some labels that humans unknowingly rated incorrectly, the best process for estimating labels involves saying what humans believe is true rather than what actually is. This can already be seen with the sycophancy problems today's LLMs are having.
There's a lot of other thorny problems in this vein that you can come up with with a few minutes of thinking. That being said, it doesn't seem completely doomed to me! There just needs to be a lot more work here. (But I haven't spent too long thinking about this, so I could be wrong.)
In other words, does there exist any dataset such that generating extrapolations from it leads to good outcomes, even in the hands of bad actors?
I think this is an important question to ask, but "even in the hands of bad actors" is just too difficult a place to start. I'm sure you're aware, but it's an unsolved problem whether there exists a dataset / architecture / training procedure such that "generating extrapolations from it leads to good outcomes," for sufficiently capable ML models, even in the hands of good actors. (And the "bad actor" piece can at least plausibly be solved by social coordination, whereas the remaining portion is a purely technical problem you can't dodge.)
But if you drop the bad actor part, I think this question is a good one to ask (but still difficult)! I think answering this question requires a better understanding of how neural networks generalize, but I can at least see worlds where the answer is "yes". (Though there are still pitfalls in how you instantiate this in reality - does your dataset need to be perfectly annotated, so that truth-telling is incentivized over sycophancy/deception? Does it require SGD to always converge to the same generalization behavior? etc.)
My summary (endorsed by Jesse):
1. ERM can be derived from Bayes by assuming your "true" distribution is close to a deterministic function plus a probabilistic error, but this fact is usually obscured
2. Risk is not a good inner product (naively) - functions with similar risk on a given loss function can be very different
3. The choice of functional norm is important, but uniform convergence just picks the sup norm without thinking carefully about it
4. There are other important properties of models/functions than just risk
5. Learning theory has failed to find tight (generalization) bounds, and bounds might not even be the right thing to study in the first place
Since nobody here has made the connection yet, I feel obliged to write something, late as I am.
To make the problem more tractable, suppose we restrict our set of coordinate changes to ones where the resulting functions can still (approximately) be written as a neural network. (These are usually called "reparameterizations.") This occurs when multiple neural networks implement (approximately) the same function; they're redundant. One trivial example of this is the invariance of ReLU networks to scaling one layer by a constant, and the next layer by the inverse of that constant.
Then, in the language of parametric statistics, this phenomenon has a name: non-identifiability! Lucky for us, there's a decent chunk of literature on identifiability in neural networks out there. At first glance, we have what seems like a somewhat disappointing result: ReLU networks are identifiable up to permutation and rescaling symmetries.
But there's a catch - this is only true except for a set of measure zero. (The other catch is that the results don't cover approximate symmetries.) This is important because there are reasons to suggest real neural networks are pushed close to this set during training. This set of measure zero corresponds to "reducible" or "degenerate" neural networks - those that can be expressed with fewer parameters. And hey, funny enough, aren't neural networks quite easily pruned?
In other parts of the literature, this problem has been phrased differently, under the framework of "structure-function symmetries" or "canonicalization." It's also often covered when discussing the concepts of "inverse stability" and "stable recovery." For more on this, including a review of the literature, I highly recommend Matthew Farrugia-Roberts' excellent master's thesis on the topic.
(Separately, I'm currently working on the issue of coordinate-free sparsity. I believe I have a solution to this - stay tuned, or reach out if interested.)
Dropping some late answers here - though this isn't my subfield, so forgive me if I mess things up here.
Correct me if I'm wrong, but it struck while reading this that you can think of a neural network as learning two things at once:
- a classification of the input into 2^N different classes (where N is the total number of neurons), each of which gets a different function applied to it
- those functions themselves
This is exactly what a spline is! This is where the spline view of neural networks comes from (mentioned in Appendix C of the post). What you call "classes" the literature typically calls the "partition." Also, while deep networks can theoretically have exponentially many elements in the partition (w.r.t. the number of neurons), in practice, they instead are closer to linear.
Can the functions and classes be decoupled?
To my understanding this is exactly what previous (non-ML) research on splines did, with things like free-knot splines. Unfortunately this is computationally intractable. So instead much research focused on fixing the partition (say, to a uniform grid), and changing only the functions. A well-known example here is the wavelet transform. But then you lose the flexibility to change the partition - incredibly important if some regions need higher resolution than others!
From this perspective the coupling of functions to the partition is exactly what makes neural networks good approximators in the first place! It allows you to freely move the partition, like with free-knot splines, but in a way that's still computationally tractable. Intuitively, neural networks have the ability to use high resolution where it's needed most, like how 3D meshes of video game characters have the most polygons in their face.
How much of the power of neural networks comes from their ability to learn to classify something into exponentially many different classes vs from the linear transformations that each class implements?
There are varying answers here, depending on what you mean by "power": I'd say either the first or neither. If you mean "the ability to approximate efficiently," then I would probably say that the partition matters more - assuming the partition is sufficiently fine, each linear transformation only performs a "first order correction" to the mean value of the partition.
But I don't really think this is where the "magic" of deep learning comes from. In fact this approximation property holds for all neural networks, including shallow ones. It can't capture what I see as the most important properties, like what makes deep networks generalize well OOD. For that you need to look elsewhere. It appears like deep neural networks have an inductive bias towards simple algorithms, i.e. those with a low (pseudo) Kolmogorov complexity. (IMO, from the spline perspective, a promising direction to explain this could be via compositionality and degeneracy of spline operators.)
Hope this helps!
The field of complex systems seems like a great source of ideas for interpretability and alignment. In lieu of a longer comment, I'll just leave this great review by Teehan et al. on emergent structures in LLMs. Section 3 in particular is great.
But in the last few years, we’ve gotten: [...]
- Robots (Boston Dynamics)
Broadly agree with this post, though I'll nitpick the inclusion of robotics here. I don't think it's progressing nearly as fast as ML, and it seems fairly uncontroversial that we're not nearly as close to human-level motor control as we are to (say) human-level writing. I only bring this up because a decent chunk of bad reasoning (usually underestimation) I see around AGI risk comes from skepticism about robotics progress, which is mostly irrelevant in my model.
Have you looked into "conditionally conserved" quantities/symmetries here? Most macroscopic properties fall into this category - e.g. the color of a particular material is conserved so long as it doesn't change phase or transmute (i.e. it stays within a particular energy range). This is associated with a (spontaneously-broken) symmetry, since the absorption spectrum of a material can be uniquely determined from its space group. I'd be willing to bet that the only information accessible at a distance (up to a change of variables) are these conditionally conserved quantities, but I've had a hard time rigorously proving it. (I believe it requires KAM theory).
Also, have you looked into Koopman spectral theory? It's for deterministic systems, but seems quite relevant - the Koopman operator maps observables at one time to observables at another, telling you how they change over time. You can relate its spectrum to underlying geometrical properties, and it's a linear operator even if your dynamics are nonlinear.
This is something I've thought about recently - a full answer would take too long to write, but I'll leave a couple comments.
First, what this implies about learning algorithms can be summarized as "it explains the manifold hypothesis." The Telephone Theorem creates an information bottleneck that limits how much information can be captured at a distance. This means that a 64x64 RGB image, despite being nominally 12288-dimensional, in reality captures far less information and lies on a much lower-dimensional latent space. Chaos has irreversibly dispersed all the information about the microscopic details of your object. "Free lunch" follows quite easily from this, since the set of functions you care about is not really the set of functions on all RGB images, but the set of functions on a much smaller latent space.
Second, the vanilla Telephone Theorem isn't quite sufficient - the only information that persists in the infinite-time limit is conserved quantities (e.g. energy), which isn't very interesting. You need to restrict to some finite time (which is sufficiently longer than your microscopic dynamics) instead. In this case, persistent information now includes "conditionally conserved" quantities, such as the color of a solid object (caused by spontaneous symmetry-breaking reducing the permanently-valid Lorentz symmetry to the temporarily-valid space group symmetry). I believe the right direction to go here is ergodic theory and KAM theory, although the details are fuzzy to me.
This is something I've been thinking about recently. In particular, you can generalize this by examining temporary conserved quantities, such as phases of matter (typically produced by spontaneous symmetry-breaking). This supports a far richer theory of information-accessible-at-a-distance than only permanently conserved quantities like energy can provide, and allows for this information to have dynamics like a stochastic process. In fact, if you know a bit of solid-state physics you probably realize exactly how much of our observed macroscopic properties (e.g. object color) are determined by things like spontaneous symmetry-breaking. You can make all of this more rigorous and systematic by connecting to ergodic theory, but this is probably deserving of a full paper, if I can get around to it. Happy to discuss more with anyone else.
Why does GPT-3 use the same matrix for word embedding and final predictions? I would expect this to constrain the model, and the only potential upsides I can see are saving parameters (lol) and preserving interpretability (lmao)[8]. Other resources like A Mathematical Framework for Transformer Circuits use different embedding/unembedding matrices - their and . Perhaps this is not necessary for GPT-3 since the final feed-forward network can perform an appropriate linear transformation, and in A Mathematical Framework they are looking at transformers without FFNs. But some properties (e.g. words being linear combinations of other words) cannot be changed by such a linear transformation, so having an entire new unembedding matrix could still add value.
This is called "tied embeddings". You're right that models don't need to have this constraint, and some don't - for instance, GPT-NeoX. I'm not sure whether or not this actually improves performance in practice though.
I don't think the game is an alarming capability gain at all - I agree with LawrenceC's comment below. It's more of a "gain-of-function research" scenario to me. Like, maybe we shouldn't deliberately try to train a model to be good at this? If you've ever played Diplomacy, you know the whole point of the game is manipulating and backstabbing your way to world domination. I think it's great that the research didn't actually seem to come up with any scary generalizable techniques or dangerous memetics, but I think ideally shouldn't even be trying in the first place.
So if streaming works as well as Cereberas claims, GPUs can do that as well or better.
Hmm, I'm still not sure I buy this, after spending some more time thinking about it. GPUs can't stream a matrix multiplication efficiently, as far as I'm aware. My understanding is that they're not very good at matrix-vector operations compared to matrix-matrix because they rely on blocked matrix multiplies to efficiently use caches and avoid pulling weights from RAM every time.
Cerebras says that the CS-2 is specifically designed for fast matrix-vector operations, and uses dataflow scheduling, so it can stream a matrix multiplication by just performing matrix-vector operations as weights stream in. And the weights are getting streamed from external RAM, rather than requested as needed, so there's no round-trip latency gunking up the works like a GPU has when it wants data from RAM.
I agree sparsity (and also probably streaming) will be increasing important; I've actually developed new techniques for sparse matrix multiplication on GPUs.
Cerebras claims that their hardware support for fast matrix-vector multiplication gives a 10x speed boost to multiplying sparse matrices, which could be helpful.
The Andromeda 'supercomputer' has peak performance of 120 pflops dense compared to 512 pflops dense for a single 256 H100 GPU pod from nvidia
I'm not sure if PFLOPs are a fair comparison here though, if I understand Cerebras' point correctly. Like, if you have ten GPUs with one PFLOP each, that's technically the same number of PFLOPs as a single GPU with ten PFLOPs. But actually that single GPU is going to train a lot faster than the ten GPUs because the ten GPUs are going to have to spend time communicating with each other. Especially as memory limitations make you resort to tensor or pipeline parallelism instead of data parallelism. Cerebras claims that to train "10 times faster you need 50 times as many GPUs."
According to this logic what you really care about instead is probably training speed or training speedup per dollar. Then the pitch for Andromeda, unlike a GPU pod, is that those 120 PFLOPS are "real" in the sense that training speed increases linearly with the PFLOPS.
The cerebras wafer-scale chip is a wierd architecture that should excel in the specific niche of training small models at high speed, but that just isn't where the industry is going. It is severely lacking in the large cheap fast off-chip RAM that GPUs have
I'm not sure I totally have a good grasp on this, but isn't this the whole point of Andromeda's weight streaming system? Fast off-chip memory combined with high memory bandwidth on the chip itself? Not sure what would limit this to small models if weights can be streamed efficiently, as Cerebras claims.
Even if I'm right, I'm not sure either of these points change the overall conclusion though. I'd guess Cerebras still isn't economically competitive or they'd be boasting it as you said.
Hmm, I see how that would happen with other architectures, but I'm a bit confused how this is here? Andromeda has the weight updates computed by a single server (MemoryX) and then distributed to all the nodes. Wouldn't this be a one-to-many broadcast with transmission time?
No substantive reply, but I do want to thank you for commenting here - original authors publicly responding to analysis of their work is something I find really high value in general. Especially academics that are outside the usual LW/AF sphere, which I would guess you are given your account age.
I'm not sure exactly where I land on this, but I think it's important to consider that restricting the data companies can train on could influence the architectures they use. Self-supervised autoregressive models a-la GPT-3 seem a lot more benign than full-fledged RL agents. The latter is a lot less data hungry than the former (especially in terms of copyrighted data). There are enough other factors here to not make me completely confident in this analysis, but it's worth thinking about.
This proposal looks really promising to me. This might be obvious to everyone, but I think much better interpretability research is really needed to make this possible in a safe(ish) way. (To verify the shard does develop, isn't misaligned, etc.) We'd just need to avoid the temptation to take the fancy introspection and interpretability tools this would require and use them as optimization targets, which would obviously make them useless as safeguards.
This is definitely the core challenge of the language model approach, and may be the reason it fails. I actually believe language models aren't the most likely approach to achieve superintelligence. But I also place a non-trivial probability on this occurring, which makes it worth thinking about for me.
Let me try to explain why I don't rule this possibility out. Obviously GPT-3 doesn't know more than a human, as evident in its sub-human performance on common tasks and benchmarks. But suppose we instead have a much more advanced system, a near-optimal sequence predictor for human-written text. Your argument is still correct - it can't output anything more than a human would know, because that wouldn't achieve minimum loss on the training data. But does that imply it can't know more than humans? That is, is it impossible for it to make use of facts that humans don't realize as an intermediate step in outputting text that only includes facts humans do realize?
I think not necessarily. As an extreme example, one particular optimal sequence predictor would be a perfect simulation, atom-for-atom, of the entire universe at the time a person was writing the text they wrote. Trivially, this sequence predictor "knows" more than humans do, since it "knows" everything, but it will also never output that information in the predicted text.
More practically, sequence prediction is just compression. More effective sequence prediction means more effective compression. The more facts about the world you know, the less data is required to describe each individual piece of text. For instance, knowing the addition algorithm is a more space-efficient way to predict all strings like "45324 + 58272 =" than memorization. As the size of the training data you're given approaches infinity, assuming a realistic space-bounded sequence predictor, the only way its performance can improve is with better world/text modeling. The fact that humans don't know a certain fact wouldn't prohibit it from being discovered if it allows more efficient sequence prediction.
Will we reach this superhuman point in practice? I don't know. It may take absurd amounts of computation and training data to reach this point, or just more than alternative approaches. But it doesn't seem impossible to me in theory.
Even if we reach this point, this still leaves the original problem - the model will not output anything more than a human would know, even if it has that knowledge internally. But even without fancy future interpretability tools, we may be heading in that direction with things like InstructGPT, where the model was fine-tuned to spit out things it was capable of saying, but wouldn't have said under pure sequence prediction.
This whole argument, together with rapid recent progress, is enough for me to not immediately write off language models, and consider strategies to take advantage of them if this scenario were to occur.