## Posts

## Comments

**interstice**on Consume fiction wisely · 2022-01-21T20:51:20.678Z · LW · GW

Seems to me that the 'helpful' works you listed contain falsehoods and wrong associations. They also contain useful information and enjoyable aspects, true -- but couldn't the same be said of lots of non-"rational" fiction? As it stands this just looks like a list of fiction that's popular among our subculture.

**interstice**on How I'm thinking about GPT-N · 2022-01-18T00:18:07.981Z · LW · GW

I continue to think 'neural nets just interpolate' is a bad criticism. Taken literally, it's obviously not true: nets are not exposed to anywhere near enough data points to interpolate their input space. On the other hand, if you think they are instead 'interpolating' in some implicit, higher-dimensional space which they project the data into, it's not clear that this limits them in any meaningful way. This is especially true if the mapping to the implicit space is itself learned, as seems to be the case in neural networks.

Regarding the 'Rashomon effect', I think it's clear that neural nets have *some* way of selecting relatively lower-complexity models, since there are also infinitely many possible models with good performance on the training set but terrible performance on the test set, yet the models learned reliably have good test set performance. Exactly how they do this is uncertain -- other commenters have already pointed out regularization is important, but the intrinsic properties of SGD/the parameter-function mapping likely also play a key role. It's an ongoing area of research.

It used to be thought that SGD sought out “flat minima” in the loss (minima with low curvature) which result in simpler models in terms of how compressible they are, but further studies have shown this isn’t really true.[11]]

The paper you cited does not show this. Instead, they construct some (rather unnatural) nets at *sharp* minima which have good generalization properties. This is completely consistent with flat minima having good generalization properties, and with SGD seeking out flat minima.

**interstice**on You can't understand human agency without understanding amoeba agency · 2022-01-08T05:21:16.805Z · LW · GW

The problem with the free energy principle is that it's not *actually* systematic in the way that physics/math is systematic -- it's more like the *idea* of what a system encompassing its topics could be. Which mostly makes it a distraction from actually developing such a system.

**interstice**on You can't understand human agency without understanding amoeba agency · 2022-01-06T07:32:21.074Z · LW · GW

I agree this is a big blindspot. My take on the intellectual history here is that(crudely speaking) MIRI et al. have mostly pursued a 'top-down' approach to agency, starting with agents such as AIXI representing the limit of unbounded rationality and compute, and then attempted to 'downsize' them such that they can actually fit in our universe(e.g. logical inductors merely need ridiculously large amounts of compute, rather than hypercomputers). This seems like a reasonable strategy a priori; there's already a well-developed theory of idealized rationality in agents that you can start with and try to 'perturb' down to fit in the actual universe, and it's plausible that a superintelligence will bear a closer resemblance to such agents than amoebae. The 'amoeba-first' strategy is difficult in that a naïve approach will just lead you to learn a bunch of irrelevant details about amoebae, not generalizing usefully to higher intelligences; a large part of the problem consists in figuring out *what* about amoebae(or whatever other system) you actually want to study, which is somewhat nebulous in contrast to the idealized-agents-first approach. Nevertheless, it does seem that the idealized agents plan has stalled out to a certain degree in recent years, and MIRI(e.g. finite factored set stuff) and other alignment researchers(e.g. johnswentworth's natural abstraction stuff) have shifted more towards the amoeba side of things. I think the 'amoeba approach' has some big advantages in that you can more readily test your ideas or get new ones by examining natural systems, plus physics seems to be the only part of the universe that really cleanly obeys mathematical laws, so a concept of agency starting from physics seems more likely to generalize to arbitrarily powerful intelligences.

**interstice**on On (Not) Reading Papers · 2021-12-21T17:18:25.398Z · LW · GW

I didn't read the whole post, just the introduction and the bolded lines of the dialogue -- but from what I read, nice post

**interstice**on On (Not) Reading Papers · 2021-12-21T17:18:25.108Z · LW · GW

I didn't read the whole post, just the introduction and the bolded lines of the dialogue -- but from what I read, nice post! I think I agree. ETA: For things on the internet, I've also adopted the heuristic of reading the top few comments between reading the introduction and the body of the post.

**interstice**on Where can one learn deep intuitions about information theory? · 2021-12-16T17:15:03.926Z · LW · GW

**interstice**on The Plan · 2021-12-11T03:17:36.255Z · LW · GW

Excellent post! This seems like a highly promising and under-explored line of attack. I've had some vaguely similar thoughts over the years, but you've done a far better job articulating and developing a coherent programme. Bravo!

I think my biggest intuitive disagreement might be with whether it is likely to be possible to create some sort of *efficient* 'abstraction thermometer' or 'agency thermometer'. Searching for possible ways of finding agents or abstractions in a system seems like a prototypical np-hard search problem. Now in practice it's often possible to solve such problems efficiently, but the setting with agents seems especially problematic in that keeping yourself obfuscated can be instrumentally useful, so I suspect the instances we're confronted with in the real world may be *adversarially selected* to be inscrutable to fast search methods in general.

**interstice**on More Christiano, Cotra, and Yudkowsky on AI progress · 2021-12-07T07:25:52.545Z · LW · GW

The 'grand story' Eliezer is referring to here isn't anything like these, though. That story is more like "there is a gradual increase in capability in all species, on an slow timescale; eventually one of them crosses the threshold of being able to produce culture which evolves on a faster timescale". Sort of the opposite of these species-parochialist tales.

**interstice**on Even if you're right, you're wrong · 2021-11-22T21:52:56.966Z · LW · GW

The first bullet point seems valid(for propositions with no empirical content)

**interstice**on Against the idea that physical limits are set in stone · 2021-11-12T09:05:00.405Z · LW · GW

Do you have reason to believe we will never collect surprising observations?

Sure, it's likely we'll get some new surprising observations before we nail down the TOE. The question is just about *how* surprising, and whether they will let us upend physical limits. Agreed that there's a lot of new interesting things we could observe, but for most of your examples, I don't think we have good reason to think that we'll learn new things about *fundamental physics* from them.

Let me rephrase my objection. I think my main issue with your post can be found in this phrase near the beginning: you speak of the "rate at which we have constantly upended our own physical theories". I don't think that progress in fundamental physics is like technological progress or other things which happen at a steady rate per unit effort. It's more like exploiting a non-renewable resource: our ignorance of physical phenomena. So 400 years ago we basically started with a huge 'reservoir' of ignorance, which has gradually been drained as our theories improved, until now there's only a few small pools left that we can see. The reason that we've seen steady progress until recently is due to our slow draining of this reservoir, so now that it's mostly gone, we no longer have a reason to expect further such steady progress. It's possible that we might find *new* reservoirs someday, but equally possible that we won't, so it's a reasonable assumption that many of our current theories' physical limits will continue to apply indefinitely into the future.

**interstice**on Against the idea that physical limits are set in stone · 2021-11-11T22:26:07.912Z · LW · GW

There's a big difference between our current state of knowledge regarding physics and previous eras': we now completely understand the physics of everyday existence. In the past, there were many, blatantly obvious unknowns -- e.g. Newtonian mechanics doesn't let you understand how chemistry arises from physics. Nowadays there's a lot less room for reality to surprise us with new observations -- indeed, theoretical physics has largely stalled out in recent years due to the infeasibility of obtaining observations our theories don't already predict. More generally, this seems to be what we should expect to happen in a lawful universe: after an initial period of discovery, we eventually discover all the laws and are done. What you propose -- an endless string of new discoveries, each upending the last -- is incompatible with the universe having a finite description. It's not logically impossible that we live in such a universe, but scientific progress so far seems to support finite lawfulness.

**interstice**on What specifically is the computation -> qualia theory? · 2021-11-03T11:04:11.737Z · LW · GW

It seems to me like there should be an infinite number of ways to interpret atoms’ vibrations as having information or even being transformed in a way that approximates the operation of an algorithm. I don’t know if any SIC campers actually worry about this, if they have specific requirements on how to tell if atoms are running algorithms, how likely they think this is to happen in practice, etc.

People do indeed worry about this, leading to things like 'Solomonoff-weighted utilitarianism' that assign higher moral relevance to minds with short description lengths.

**interstice**on [Book Review] "The Vital Question" by Nick Lane · 2021-09-28T23:50:54.395Z · LW · GW

Before seeing any evidence, we should indeed expect that life has high density in the universe. We just have enough data to rule that out. More generally I think UDASSA is probably the best framework for approaching problems like this, and it would hold that, in situations where our existence is contingent on an anthropically-selected unlikely event, we should still expect that this event is as likely as possible while being consistent with the evidence. So 10^-40 likelihood origination events more probably than 10^-400 likelihood events.

**interstice**on This Can't Go On · 2021-09-20T04:46:56.462Z · LW · GW

There's some discussion of this in a followup post.

**interstice**on Player vs. Character: A Two-Level Model of Ethics · 2021-09-12T07:39:42.394Z · LW · GW

What seems off to me is the idea that the 'player' is some sort of super-powerful incomprehensible lovecraftian optimizer. I think it's more apt to think of it as like a monkey, but a monkey which happens to share your body and have write access to the deepest patterns of your thought and feeling(see Steven Byrnes' posts for the best existing articulation of this view). It's just a monkey, its desires aren't totally alien and I think it's quite possible for one's conscious mind to develop a reasonably good idea of what it wants. That the OP prefers to push the 'alien/lovecraftian' framing is interesting and perhaps indicates that they find what their monkey (and/or other peoples' monkeys) wants repulsive in some way.

**interstice**on I read “White Fragility” so you don’t have to (but maybe you should) · 2021-09-07T17:41:58.375Z · LW · GW

In rationalist circles, you might find out that you're being instrumentally or epistemically irrational in the course of a debate -- the norms of such a debate encourage you to rebut your opponent's points if you think they are being unfair. In contrast, the central thesis of this book is that white people disputing their racism is a mechanism for protecting white supremacy and needs to be unlearned, along with other cornerstones of collective epistemology such as the notion of objective knowledge. So under the epistemic conditions promoted by this book, I expect "found about being racist" to roughly translate to "was told you were racist".

**interstice**on The Codex Skeptic FAQ · 2021-08-25T20:38:12.073Z · LW · GW

I think those advancements could be evidence for both, depending on the details of how the nootropics work, etc. But it still seems worth distinguishing the two things conceptually. My objection in both cases is that only a small part of the evidence for the first comes from the *causal* impact of the second: i.e. if Codex gave crazy huge productivity improvements, I would consider that evidence for full code automation coming soon, but that's mostly because it suggests that Codex can likely be improved to the point of FCA, *not* because it will make OpenAI's progammers more productive.

**interstice**on The Codex Skeptic FAQ · 2021-08-25T00:51:50.450Z · LW · GW

Regarding your first point, I think when people say that language models "don't bring us closer to full code automation" they mean there's no way of improving/upgrading language models such that they implement full code automation. I think it would be better to argue against that claim directly instead of bringing up language model's productivity-boosting effects. There are many things that could potentially boost programmers' productivity -- better nootropics, say -- but it seems overly broad to say that they all "bring us closer to full code automation", even if it might be causally true that they reduce the time to automation in expectation.

**interstice**on Why did no LessWrong discourse on gain of function research develop in 2013/2014? · 2021-06-19T15:54:32.790Z · LW · GW

I actually agree with you there, there was always discussion of GCR along with extinction risks(though I think Eliezer in particular was more focused on extinction risks). However, they're still distinct categories: even the deadliest of pandemics is unlikely to cause extinction.

**interstice**on Why did no LessWrong discourse on gain of function research develop in 2013/2014? · 2021-06-19T15:36:07.712Z · LW · GW

Killing 90% of the human population would not be enough to cause extinction. That would put us at a population of 800 million, higher than the population in 1700.

**interstice**on Reply to Nate Soares on Dolphins · 2021-06-10T23:00:22.624Z · LW · GW

It could be considered an essence, but physical rather than metaphysical.

**interstice**on The Homunculus Problem · 2021-05-27T21:35:01.458Z · LW · GW

This feels related to metaphilosophy to me. In the sense that, (to me) it seems that one of the core difficulties of metaphilosophy is that in coming up with a 'model' agent you need to create an agent that is not only capable of thinking about its own structure, but capable of being *confused* about what that structure is(and presumably, of becoming un-confused). Bayesian etc. approaches can model agents being confused about object-level things, but it's hard to even imagine what a model of an agent confused about ontology would look like.

**interstice**on The Homunculus Problem · 2021-05-27T21:34:45.660Z · LW · GW

This feels related to metaphilosophy. In the sense that, (to me) it seems that one of the core difficulties of metaphilosophy is that in coming up with a 'model' agent you need to create an agent that is not only capable of thinking about its own structure, but capable of being *confused* about what that structure is(and presumably, of becoming un-confused). Bayesian etc. approaches can model agents being confused about object-level things, but it's hard to even imagine what a model of an agent confused about ontology would look like.

**interstice**on SGD's Bias · 2021-05-19T19:52:43.229Z · LW · GW

Another example of this sort of thing: least-rattling feedback in driven systems.

**interstice**on Parsing Chris Mingard on Neural Networks · 2021-05-15T21:13:44.031Z · LW · GW

Perhaps this is a physicist vs mathematician type of thinking though

Good guess ;)

This is not the same as saying that an extremely wide trained-by-random-sampling neural network would not learn features—there is a possibility that the first time you reach 100% training accuracy corresponds to effectively randomly initialised initial layers + trained last layer, but in expectation all the layers should be distinct from an entirely random intialisation.

I see -- so you're saying that even though the distribution of *output* functions learned by an infinitely-wide randomly-sampled net is unchanged if you freeze everything but the last layer, the distribution of *intermediate* functions might change. If true, this would mean that feature learning and inductive bias are 'uncoupled' for infinite-width randomly-sampled nets. I think this is false, however -- that is, I think it's provable that the distribution of intermediate functions does *not* change in the infinite-width limit when you condition on the training data, even when conditioning over all layers. I can't find a reference offhand though, I'll report back if I find anything resolving this one way or another.

**interstice**on Parsing Chris Mingard on Neural Networks · 2021-05-14T03:14:12.170Z · LW · GW

The claim I am making is that the reason why feature learning is good is not because it improves inductive bias—it is because it allows the network to be compressed. That is probably at the core of our disagreement.

Yes, I think so. Let's go over the 'thin network' example -- we want to learn some function which can be represented by a thin network. But let's say a randomly-initialized thin network's intermediate functions won't be able to fit the function -- that is (with high probability over the random initialization) we won't be able to fit the function just by changing the parameters of the last layer. It seems there are a few ways we can alter the network to make fitting possible:

(A) Expand the network's width until (with high probability) it's possible to fit the function by only altering the last layer

(B) Keeping the width the same, re-sample the parameters in all layers until we find a setting that can fit the function

(C) Keeping the width the same, train the network with SGD

By hypothesis, all three methods will let us fit the target function. You seem to be saying[I think, correct me if I'm wrong] that all three methods should have the same inductive bias as well. I just don't see any reason this should be the case -- on the face of it, I would guess that all three have different inductive biases(though A and B might be similar). They're clearly different in some respects -- (C) can do transfer learning but (A) cannot(B is unclear).

What do we know about SGD-trained nets that suggests this?

My intuition here is that SGD-trained nets can learn functions non-linearly while NTK/GP can only do so linearly. So in the car detector example, SGD is able to develop a neuron detecting cars through some as-yet unclear 'feature learning' mechanism. The NTK/GP can do so as well, sort of, since they're universal function approximators. However, the way they do this is by taking a giant linear combination of random functions which is able to function identically to a car detector on the data points given. It seems like this might be more fragile/generalize worse than the neurons produced by SGD. Though that is admittedly somewhat conjectural at this stage, since we don't really have a great understanding of how feature learning in SGD works.

I’ve read the new feature learning paper! We’re big fans of his work, although again I don’t think it contradicts anything I’ve just said.

ETA: Let me elaborate upon what I see as the significance of the 'feature learning in infinite nets' paper. We know that NNGP/NTK models can't learn features, but SGD can: I think this provides strong evidence that they are learning using different mechanisms, and likely have substantially different inductive biases. The question is whether randomly sampled *finite* nets can learn features as well. Since they are equivalent to NNGP/NTK at infinite width, any feature learning they do *can only come from finiteness*. In contrast, in the case of SGD, it's possible to do feature learning *even in the infinite-width limit*. This suggests that even if randomly-sampled finite nets can do feature learning, the mechanism by which they do so is different from SGD, and hence their inductive bias is likely to be different as well.

**interstice**on Parsing Chris Mingard on Neural Networks · 2021-05-12T20:51:47.251Z · LW · GW

First thank you for your comments and observations—it’s always interesting to read pushback

And thanks for engaging with my random blog comments! TBC, I think you guys are definitely on the right track in trying to relate SGD to function simplicity, and the empirical work you've done fleshing out that picture is great. I just think it could be even *better* if it was based around a better SGD scaling limit ;)

Therefore, if an optimiser samples functions proportional to their volume, you won’t get any difference in performance if you learn features (optimise the whole network) or do not learn features (randomly initialise and freeze all but the last layer and then train just the last).

Right, this is an even better argument that NNGPs/random-sampled nets don't learn features.

Given therefore that the posteriors are the same, it implies that feature learning is not aiding inductive bias—rather, feature learning is important for expressivity reasons

I think this only applies to NNGP/random-sampled nets, not SGD-trained nets. To apply to SGD-trained nets, you'd need to show that the new features learned by SGD have the same distribution as the features found in an infinitely-wide random net, but I don't think this is the case. By illustration, some SGD-trained nets can develop expressive neurons like 'car detector', enabling them to fit the data with a relatively small number of such neurons. If you used an NNGP to learn the same thing, you wouldn't get a single 'car detector' neuron, but rather some huge linear combination of high-frequency features that can approximate the cars seen in the dataset. I think this would probably generalize worse than the network with an actual 'car detector'(this isn't empirical evidence of course, but I think what we know about SGD-trained nets and the NNGP strongly suggests a picture like this)

Furthermore (and on a slightly different note), it is known that infintesimal GD converges to the Boltzmann distribution for any DNN (very similar to random sampling)

Interesting, haven't seen this before. Just skimming the paper, it sounds like the very small learning rate + added white noise might result in different limiting behavior from usual SGD. Generally it seems that there are a lot of different possible limits one can take; empirically SGD-trained nets do seem to have 'feature learning' so I'm skeptical of limits that don't have that(I assume they don't have them for theoretical reasons, anyway. Would be interesting to actually examine the features found in networks trained like this, and to see if they can do transfer learning at all) re:'colored noise', not sure to what extent this matters. I think a more likely source of discrepancy is the *lack* of white noise in normal training(I guess this counts as 'colored noise' in a sense) and the larger learning rate.

if anyone can point out why this line of argument is not correct, or can steelman a case for SGD inductive bias appearing at larger scales, I would be very interested to hear it.

Not to be a broken record, but I strongly recommend checking out Greg Yang's work. He clearly shows that there exist infinite-width limits of SGD that can do feature/transfer learning.

**interstice**on Parsing Chris Mingard on Neural Networks · 2021-05-12T05:32:34.359Z · LW · GW

I think we basically agree on the state of the empirical evidence -- the question is just whether NTK/GP/random-sampling methods will continue to match the performance of SGD-trained nets on more complex problems, or if they'll break down, ultimately being a first-order approximation to some more complex dynamics. I think the latter is more likely, mostly based on the lack of feature learning in NTK/GP/random limits.

re: the architecture being the source of inductive bias -- I certainly think this is true in the sense that architecture choice will have a bigger effect on generalization than hyperparameters, or the choice of which local optimizer to use. But I do think that using a local optimizer at all, as opposed to randomly sampling parameters, is likely to have a large effect.

**interstice**on Parsing Chris Mingard on Neural Networks · 2021-05-12T05:02:50.779Z · LW · GW

Yeah, I didn't mean to imply that you guys said 'simple --> large volume' anywhere. I just think it's a point worth emphasizing, especially around here where I think people will imagine "Solomonoff Induction-like" when they hear about a "bias towards simple functions"

Also, very briefly on your comment on feature learning—the GP limit is used to calculate the volume of functions locally to the initialization. The fact that kernel methods do not learn features should not be relevant given this interpretation

But in the infinite-width setting, Bayesian inference in general is given by a GP limit, right? Initialization doesn't matter. This means that the arguments for lack of feature learning still go through. It's technically possible that there could be feature learning in *finite*-width randomly-sampled networks, but it seems strange that finiteness would help here(and any such learning would be experimentally inaccessible). This is a major reason that I'm skeptical of the "SGD as a random sampler" picture.

**interstice**on Challenge: know everything that the best go bot knows about go · 2021-05-12T03:25:34.048Z · LW · GW

If your goal is to play as well as the best go bot and/or write a program that plays equally well from scratch, it seems like it's probably impossible. A lot of the go bot's 'knowledge' could well be things like "here's a linear combination of 20000 features of the board predictive of winning". There's no reason for the coefficients of that linear combination to be compressible in any way; it's just a mathematical fact that these particular coefficients happen to be the best at predicting winning. If you accepted "here the model is taking a giant linear combination of features" as "understanding" it might be more doable.

**interstice**on Parsing Chris Mingard on Neural Networks · 2021-05-07T20:11:06.486Z · LW · GW

Is that the empirical evidence attempts to demonstrate simple --> large volume but is inconclusive, or is it that the empirical evidence does not even attempt to demonstrate simple --> large volume?

They don't really try to show simple --> large volume. They show is that there is substantial 'clustering, so *some* simple functions have large volume. I like nostalgebraist's remarks on their clustering measures.

so it seems a little unfair to say that the evidence is that the performance is similar, since that would suggest that they were just comparing max performance by SGD to max performance by NNGP.

Fair point, they do compare the distributions as well. I don't think it's too surprising that they're similar since they compare them on the test points of the distribution which they were trained to fit.

It sounds like you do think there is some chance that neural network generalization is due to an architectural bias towards simplicity

I do, although I'm not sure if I would say 'architectural bias' since I think SGD might play an important role. Unfortunately I don't really have too much substantial to say about this -- Mingard is the only researcher I'm aware of explicitly trying to link networks to simplicity priors. I think the most promising way to make progress here is likely to be analyzing neural nets in some non-kernel limit like Greg Yang's work or this paper.

**interstice**on NTK/GP Models of Neural Nets Can't Learn Features · 2021-05-07T18:52:38.735Z · LW · GW

They would exist in a *sufficiently* big random NN, but their density would be extremely low I think. Like, if you train a normal neural net with a 15000 neurons and then there's a car detector, the density of car detectors is now 1/15000. Whereas I think the density at initialization is probably more like 1/2^50 or something like that(numbers completely made up), so they'd have a negligible effect on the NTK's learning ability('slight tweaks' can't happen in the NTK regime since no intermediate functions change by definition)

A difference with the pruning case is that the number of possible prunings increases exponentially with the number of neurons but the number of neurons is linear. My take on the LTH is that pruning is basically just a weird way of doing optimization so it's not that surprising you can get good performance.

**interstice**on NTK/GP Models of Neural Nets Can't Learn Features · 2021-05-07T17:54:54.393Z · LW · GW

Yeah, that summary sounds right.

I'd say (b) -- it seems quite unlikely to me that the NTK/GP are universally data-efficient, while neural nets might be(although that's mostly speculation on my part). I think the lack of feature learning is a stronger argument that NTK/GP don't characterize neural nets well.

**interstice**on Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian · 2021-05-07T17:45:53.888Z · LW · GW

Yeah, exactly -- the problem is that there are some small-volume functions which are actually simple. The argument for small-volume --> complex doesn't go through since there could be other ways of specifying the function.

Other senses of simplicity include various circuit complexities and Levin complexity. There's no argument that parameter-space volume corresponds to either of them AFAIK(you might think parameter-space volume would correspond to "neural net complexity", the number of neurons in a minimal-size neural net needed to compute the function, but I don't think this is true either -- every parameter is Gaussian so it's unlikely for most to be zero)

**interstice**on Parsing Chris Mingard on Neural Networks · 2021-05-07T03:03:14.564Z · LW · GW

For reasons elaborated upon in this post and its comments, I'm kinda skeptical of these results. Basically the claims made are

(A) That the parameter->function map is "biased towards simple functions". It's important to distinguish simple --> large volume and large volume --> simple. Simple --> large volume is the property that Solomonoff induction has and what makes it universal, but large volume-->simple is what is proven in these papers(plus some empirical evidence of unclear import)

(B) SGD being equivalent to random selection. The evidence is empirical performance of Gaussian processes being similar to neural nets on simple tasks. But this may break down on more difficult problems(link is about the NTK, not GP, but they tend to perform similarly, indeed NTK usually performs better than GP)

Overall I think it's likely we'll need to actually analyze SGD in a non-kernel limit to get a satisfactory understanding of "what's really going on" with neural nets.

**interstice**on NTK/GP Models of Neural Nets Can't Learn Features · 2021-05-07T00:24:10.619Z · LW · GW

There's an important distinction^{[1]} to be made between these two claims:

A) Every function with large volume in parameter-space is simple

B) Every simple function has a large volume in parameter space

For a method of inference to qualify as a 'simplicity prior', you want both claims to hold. This is what lets us derive bounds like 'Solomonoff induction matches the performance of any computable predictor', since all of the simple, computable predictors have relatively large volume in the Solomonoff measure, so they'll be picked out after boundedly many mistakes. In particular, you want there to be an implication like, if a function has complexity , it will have parameter-volume at least .

Now, the Mingard results, at least the ones that have mathematical proof, rely on the Levin bound. This only shows (A), which is the direction that is much easier to prove -- it automatically holds for any mapping from parameter-space to functions with bounded complexity. They also have some empirical results that show there is substantial 'clustering', that is, there are *some* simple functions that have large volumes. But this still doesn't show that all of them do, and indeed is compatible with the learnable function class being extremely limited. For instance, this could easily be the case even if NTK/GP was only able to learn linear functions. In reality the NTK/GP is capable of approximating arbitrary functions on finite-dimensional inputs but, as I argued in another comment, this is not the right notion of 'universality' for classification problems. I strongly suspect^{[2]} that the NTK/GP can be shown to not be 'universally data-efficient' as I outlined there, but as far as I'm aware no one's looked into the issue formally yet. Empirically, I think the results we have so far suggest that the NTK/GP is a decent first-order approximation for simple tasks that tends to perform worse on the more difficult problems that require non-trivial feature learning/efficiency.

I actually posted basically the same thing underneath another one of your comments a few weeks ago, but maybe you didn't see it because it was only posted on LW, not the alignment forum ↩︎

Basically, because in the NTK/GP limit the functions for all the neurons in a given layer are sampled from a single computable distribution, so I think you can show that the embedding is 'effectively finite' in some sense(although note it

*is*a universal approximator for fixed input dimension) ↩︎

**interstice**on AMA: Paul Christiano, alignment researcher · 2021-04-30T17:11:10.317Z · LW · GW

Have you read much philosophy? If so, what are your favorite books/articles?

**interstice**on AMA: Paul Christiano, alignment researcher · 2021-04-30T01:16:58.979Z · LW · GW

Any thoughts on the Neural Tangent Kernel/Gaussian Process line of research? Or attempts to understand neural network training at a theoretical level more generally?

**interstice**on NTK/GP Models of Neural Nets Can't Learn Features · 2021-04-24T21:20:15.826Z · LW · GW

By universal approximation, these features will be sufficient for any downstream learning task

Right, but trying to fit an unknown function with linear combinations of those features might be *extremely data-inefficient*, such that it is basically unusable for difficult tasks. Of course you could do better if you're not restricted to linear combinations -- for instance, if the map is injective you could invert back to the original space and apply whatever algorithm you wanted. But at that point you're not really using the Fourier features at all. In particular, the NTK always learns a linear combination of its features, so it's the efficiency of linear combinations that's relevant here.

I agree that there is no learning taking place and that such a method may be inefficient. However, that goes beyond my original objection.

You originally said that the NTK doesn't learn features because its feature class already has a good representation at initialization. What I was trying to convey (rather unclearly, admittedly) in response is:

A) There exist learning algorithms that have universal-approximating embeddings at initialization yet learn features. If we have an example of P and !Q, P-->Q cannot hold in general, so I don't think it's right to say that the NTK's lack of feature learning is due to its universal-approximating property.

B) Although the NTK's representation may be capable of approximating arbitrary functions, it will probably be very *slow* at learning some of them, perhaps so slow that using it is infeasible. So I would dispute that it already has 'good' representations. While it's universal in one sense, there might be some other sense of 'universal efficiency' in which it's lacking, and where feature-learning algorithms can outperform it.

This is not a trivial question

I agree that in practice there's likely to be some relationship between universal approximation and efficiency, I just think it's worth distinguishing them conceptually. Thanks for the paper link BTW, it looks interesting.

**interstice**on NTK/GP Models of Neural Nets Can't Learn Features · 2021-04-24T02:23:24.218Z · LW · GW

Ah, rereading your original comment more carefully I see that you indeed didn't say anything about 'universal learning'. You're quite right that the NTK is a universal function approximator. My apologies.

However, I still disagree that this is the reason that the NTK doesn't learn features. I think that 'universal function approximation' and 'feature learning' are basically unrelated dimensions along which a learning algorithm can vary. That is, it's quite possible to imagine a learning algorithm which constructs a sequence of different embeddings, *all* of which are universal approximators. The paper by Greg Yang I linked gives an example of such an algorithm(I don't think he explicitly proves this but I'm pretty sure it's true)

What I was trying to get at with the 'universal learning' remarks is that, although the NTK does indeed contain all finite embeddings, I believe that it does not do so in a very efficient way -- it might require disproportionately many training points to pick out what are, intuitively, fairly simple embeddings. I believe this is what is behind the poor performance of empirical NTKs compared to SGD-trained nets, as I brought up in this comment, and ultimately explains why algorithms that do 'feature learning' can outperform those that don't -- the feature learning algorithms are able to find more efficient embeddings for a given set of inputs(of course, it's possible to imagine a *fixed* embedding that's 'optimally efficient' in some way, but as far as I'm aware the NTK has no such property). This issue of 'embedding efficiency' seems only loosely related to the universal approximation property. To formalize this, it would be nice to develop a theory of universal inference in the setting of classification problems akin to Solomonoff induction. To effectively model this in an asymptotic theory, I think it might be necessary to increase the dimension of the model input along with the number of data points, since otherwise all universal approximators for a given dimension will have asymptotically the same performance. Everything in this paragraph is just my personal speculation though, as far as I'm aware there's no existing theory of universal inference in classification problems, so if you found my remarks confusing that's pretty understandable :)

**interstice**on NTK/GP Models of Neural Nets Can't Learn Features · 2021-04-23T21:48:16.174Z · LW · GW

I think nets are usually increased in depth as well as width when they are 'scaled up', so the NTK limit doesn't apply -- the convergence to NTK is controlled by the ratio of depth to width, only approaching a deterministic kernel if this ratio approaches 0.

**interstice**on NTK/GP Models of Neural Nets Can't Learn Features · 2021-04-23T21:29:13.916Z · LW · GW

There's a big difference between 'universal learner' and 'fits any smooth function on a fixed input space'. The 'universal learner' property is about data efficiency: do you have bounded regret compared to any learning algorithm in some wide class? Solomonoff induction has this property with respect to computable predictors on binary strings, for instance. There are lots of learning algorithms able to fit any finite binary sequence but which are not universal. I haven't seen a good formalism for this in the neural net case, but I think it would involve letting the input dimension increase with the number of data points, and comparing the asymptotic performance of various algorithms.

**interstice**on NTK/GP Models of Neural Nets Can't Learn Features · 2021-04-23T20:25:30.733Z · LW · GW

So I read through the Maddox et al. study, and it definitely does *not* show that the NTK can do transfer learning. They pre-train *using SGD* on a single task, then use the NTK computed on the trained network to do Bayesian inference on some other tasks. They say in a footnote on page 9, "Note that in theory, there is no need to train the network at all. We found that it is practically useful to train the network to learn good representations." This makes me suspect that they tried using the NTK to learn the transfer parameters but it didn't work.

Regarding the empirical results about the NTK explaining the performance of neural nets, I found this study interesting. They computed the 'empirical NTK' on some finite-width networks and compared the performance of the solution found by SGD to that found by solving the NTK. For standard widths, the NTK solution performed substantially worse(up to 20% drop in accuracy). The gap closed to some extent, but not completely, upon making the network much wider. The size of the gap also correlated with the complexity of the task(0.5% gap for MNIST, 13% for CIFAR, 18% for a subset of ImageNet). The trajectory of the weights also diverged substantially from the NTK prediction, even on MNIST. All of this seems consistent with the NTK being a decent first-order approximation that breaks down on the really hard tasks that require the networks to do non-trivial feature learning.

**interstice**on NTK/GP Models of Neural Nets Can't Learn Features · 2021-04-23T17:12:46.002Z · LW · GW

Yes, and indeed in the NTK limit we can model ordinary training that way.

**interstice**on NTK/GP Models of Neural Nets Can't Learn Features · 2021-04-23T17:11:34.522Z · LW · GW

NTK doesn’t learn features because the feature class at initialization is a universal class

I've never heard of any result suggesting this, what's your argument? I suspect the opposite -- by the central limit theorem the partial derivatives and activations at each layer tend toward samples from a fixed distribution(differing per layer but fixed across neurons). I think this means that the NTK embedding is 'essentially finite' and actually not universal(though I'm not sure). Note that to show universality it's not enough to show that all embeddings can be found, you'll also need an argument showing that their density in the NTK embedding is bounded above zero.

**interstice**on NTK/GP Models of Neural Nets Can't Learn Features · 2021-04-23T05:33:18.015Z · LW · GW

Hmm, so regarding the linear combinations, it's true that there are *some* linear combinations that will change by in the large-width limit -- just use the vector of partial derivatives of the output at some particular input, this sum will change by the amount that the output function moves during the regression. Indeed, I suspect(but don't have a proof) that these particular combinations will span the space of linear combinations that change non-trivially during training. I would dispute "we expect most linear combinations to change" though -- the CLT argument implies that we should expect almost all combinations to *not* appreciably change. Not sure what effect this would have on the PCA and still think it's plausible that it doesn't change at all(actually, I think Greg Yang states that it doesn't change in section 9 of his paper, haven't read that part super carefully though)

And the tangent kernel not changing does not imply that transfer learning won’t work

So I think I was a bit careless in saying that the NTK can't do transfer learning at all -- a more exact statement might be "the NTK does the minimal amount of transfer learning possible". What I mean by this is, *any* learning algorithm can do transfer learning if the task we are 'transferring' to is sufficiently similar to the original task -- for instance, if it's just the exact same task but with a different data sample. I claim that the 'transfer learning' the NTK does is of this sort. As you say, since the tangent kernel doesn't change at all, the net effect is to move where the network starts in the tangent space. Disregarding convergence speed, the impact this has on generalization is determined by the values set by the old function on axes of the NTK outside of the span of the partial derivatives at the new function's data points. This means that, for the NTK to transfer anything from one task to another, it's not enough for the tasks to both feature, for instance, eyes. It's that the eyes have to correlate with the output in the *exact same way* in both tasks. Indeed, the transfer learning could actually hurt the generalization. Nor is its effect invariant under simple transformations like flipping the sign of the target function(this would change beneficial transfer to harmful). By default, for functions that aren't simple multiples, I expect the linear correlation between values on different axes to be about 0, even if the functions share many meaningful features. So while the NTK can do 'transfer learning' in a sense, it's about as weak as possible, and I strongly doubt that this sort of transfer is sufficient to explain transfer learning's successes in practice(but don't have empirical proof).

I do think the empirical results pretty strongly suggest that the NTK/GP model captures everything important about neural nets, at least in terms of their performance on the original task.

It's true that NTK/GP perform pretty closely to finite nets on the tasks we've tried them on so far, but those tasks are pretty simple and we already had decent non-NN solutions. Generally the pattern is '"GP matches NNs on really simple tasks, NTK on somewhat harder ones". I think the data we have is consistent with this breaking down as we move to the harder problems that have no good non-NN solutions. I would be very interested in seeing an experiment with NTK on, say, ImageNet for this reason, but as far as I know no one's done so because of the prohibitive computational cost.

I only found one directly-relevant study, which is on way too small and simple a system for me to draw much of a conclusion from it, but it does seem to have worked.

Thanks for the link -- will read this tomorrow.

BTW, thanks for humoring me throughout this thread. This is really useful, and my understanding is updating considerably.

And thank you for engaging in detail -- I have also found this very helpful in forcing me to clarify(partially to myself) what my actual beliefs are.

**interstice**on NTK/GP Models of Neural Nets Can't Learn Features · 2021-04-22T18:46:20.922Z · LW · GW

I don't think taking linear combinations will help, because adding terms to the linear combination will also increase the magnitude of the original activation vector -- e.g. if you add together units, the magnitude of the sum of their original activations will with high probability be , dwarfing the O(1) change due to change in the activations. But regardless, it can't help with transfer learning at all, since the tangent kernel(which determines learning in this regime) doesn't change by definition.

What empirical results do you think are being contradicted? As far as I can tell, the empirical results we have are 'NTK/GP have similar performance to neural nets on some, but not all, tasks'. I don't think transfer/feature learning is addressed at all. You might say these results are suggestive evidence that NTK/GP captures everything important about neural nets, but this is precisely what is being disputed with the transfer learning arguments.

I can imagine doing an experiment where we find the 'empirical tangent kernel' of some finite neural net at initialization, solve the linear system, and then analyze the activations of the resulting network. But it's worth noting that this is not what is usually meant by 'NTK' -- that usually includes taking the infinite-width limit at the same time. And to the extent that we expect the activations to change at all, we no longer have reason to think that this linear system is a good approximation of SGD. That's what the above mathematical results mean -- the same mathematical analysis that implies that network training is like solving a linear system, *also* implies that the activations don't change at all.

**interstice**on NTK/GP Models of Neural Nets Can't Learn Features · 2021-04-22T17:03:29.303Z · LW · GW

The result that NTK does not learn features in the large N limit is not in dispute at all -- it's right there on page 15 of the original NTK paper, and indeed holds after arbitrarily many steps of backprop. I don't think that there's really much room for loopholes in the math here. See Greg Yang's paper for a lengthy proof that this holds for all architectures. Also worth noting that when people 'take the NTK limit' they often don't initialize an actual net at all, they instead use analytical expressions for what the inner product of the gradients would be at N=infinity to compute the kernel directly.

**interstice**on NTK/GP Models of Neural Nets Can't Learn Features · 2021-04-22T05:52:53.947Z · LW · GW

The asymmetry between the output function and the intermediate neuron functions comes from backprop -- from the fact that the gradients are backprop-ed through weight matrices with entries of magnitude O(). So the gradient of the output w.r.t itself is obviously 1, then the gradient of the output w.r.t each neuron in the preceding layer is O(), since you're just multiplying by a vector with those entries. Then by induction all other preceding layers' gradients are the sum of N random things of size O(1/N), and so are of size O() again. So taking a step of backprop will change the output function by O(1) but the intermediate functions by O(), vanishing in the large-width limit.

(This is kind of an oversimplification since it is possible to have changing intermediate functions while doing backprop, as mentioned in the linked paper. But this is the essence of why it's possible in *some* limits to move around using backprop without changing the intermediate neurons)