## Posts

## Comments

**interstice**on Why did no LessWrong discourse on gain of function research develop in 2013/2014? · 2021-06-19T15:54:32.790Z · LW · GW

I actually agree with you there, there was always discussion of GCR along with extinction risks(though I think Eliezer in particular was more focused on extinction risks). However, they're still distinct categories: even the deadliest of pandemics is unlikely to cause extinction.

**interstice**on Why did no LessWrong discourse on gain of function research develop in 2013/2014? · 2021-06-19T15:36:07.712Z · LW · GW

Killing 90% of the human population would not be enough to cause extinction. That would put us at a population of 800 million, higher than the population in 1700.

**interstice**on Reply to Nate Soares on Dolphins · 2021-06-10T23:00:22.624Z · LW · GW

It could be considered an essence, but physical rather than metaphysical.

**interstice**on The Homunculus Problem · 2021-05-27T21:35:01.458Z · LW · GW

This feels related to metaphilosophy to me. In the sense that, (to me) it seems that one of the core difficulties of metaphilosophy is that in coming up with a 'model' agent you need to create an agent that is not only capable of thinking about its own structure, but capable of being *confused* about what that structure is(and presumably, of becoming un-confused). Bayesian etc. approaches can model agents being confused about object-level things, but it's hard to even imagine what a model of an agent confused about ontology would look like.

**interstice**on The Homunculus Problem · 2021-05-27T21:34:45.660Z · LW · GW

This feels related to metaphilosophy. In the sense that, (to me) it seems that one of the core difficulties of metaphilosophy is that in coming up with a 'model' agent you need to create an agent that is not only capable of thinking about its own structure, but capable of being *confused* about what that structure is(and presumably, of becoming un-confused). Bayesian etc. approaches can model agents being confused about object-level things, but it's hard to even imagine what a model of an agent confused about ontology would look like.

**interstice**on SGD's Bias · 2021-05-19T19:52:43.229Z · LW · GW

Another example of this sort of thing: least-rattling feedback in driven systems.

**interstice**on Parsing Chris Mingard on Neural Networks · 2021-05-15T21:13:44.031Z · LW · GW

Perhaps this is a physicist vs mathematician type of thinking though

Good guess ;)

This is not the same as saying that an extremely wide trained-by-random-sampling neural network would not learn features—there is a possibility that the first time you reach 100% training accuracy corresponds to effectively randomly initialised initial layers + trained last layer, but in expectation all the layers should be distinct from an entirely random intialisation.

I see -- so you're saying that even though the distribution of *output* functions learned by an infinitely-wide randomly-sampled net is unchanged if you freeze everything but the last layer, the distribution of *intermediate* functions might change. If true, this would mean that feature learning and inductive bias are 'uncoupled' for infinite-width randomly-sampled nets. I think this is false, however -- that is, I think it's provable that the distribution of intermediate functions does *not* change in the infinite-width limit when you condition on the training data, even when conditioning over all layers. I can't find a reference offhand though, I'll report back if I find anything resolving this one way or another.

**interstice**on Parsing Chris Mingard on Neural Networks · 2021-05-14T03:14:12.170Z · LW · GW

The claim I am making is that the reason why feature learning is good is not because it improves inductive bias—it is because it allows the network to be compressed. That is probably at the core of our disagreement.

Yes, I think so. Let's go over the 'thin network' example -- we want to learn some function which can be represented by a thin network. But let's say a randomly-initialized thin network's intermediate functions won't be able to fit the function -- that is (with high probability over the random initialization) we won't be able to fit the function just by changing the parameters of the last layer. It seems there are a few ways we can alter the network to make fitting possible:

(A) Expand the network's width until (with high probability) it's possible to fit the function by only altering the last layer

(B) Keeping the width the same, re-sample the parameters in all layers until we find a setting that can fit the function

(C) Keeping the width the same, train the network with SGD

By hypothesis, all three methods will let us fit the target function. You seem to be saying[I think, correct me if I'm wrong] that all three methods should have the same inductive bias as well. I just don't see any reason this should be the case -- on the face of it, I would guess that all three have different inductive biases(though A and B might be similar). They're clearly different in some respects -- (C) can do transfer learning but (A) cannot(B is unclear).

What do we know about SGD-trained nets that suggests this?

My intuition here is that SGD-trained nets can learn functions non-linearly while NTK/GP can only do so linearly. So in the car detector example, SGD is able to develop a neuron detecting cars through some as-yet unclear 'feature learning' mechanism. The NTK/GP can do so as well, sort of, since they're universal function approximators. However, the way they do this is by taking a giant linear combination of random functions which is able to function identically to a car detector on the data points given. It seems like this might be more fragile/generalize worse than the neurons produced by SGD. Though that is admittedly somewhat conjectural at this stage, since we don't really have a great understanding of how feature learning in SGD works.

I’ve read the new feature learning paper! We’re big fans of his work, although again I don’t think it contradicts anything I’ve just said.

ETA: Let me elaborate upon what I see as the significance of the 'feature learning in infinite nets' paper. We know that NNGP/NTK models can't learn features, but SGD can: I think this provides strong evidence that they are learning using different mechanisms, and likely have substantially different inductive biases. The question is whether randomly sampled *finite* nets can learn features as well. Since they are equivalent to NNGP/NTK at infinite width, any feature learning they do *can only come from finiteness*. In contrast, in the case of SGD, it's possible to do feature learning *even in the infinite-width limit*. This suggests that even if randomly-sampled finite nets can do feature learning, the mechanism by which they do so is different from SGD, and hence their inductive bias is likely to be different as well.

**interstice**on Parsing Chris Mingard on Neural Networks · 2021-05-12T20:51:47.251Z · LW · GW

First thank you for your comments and observations—it’s always interesting to read pushback

And thanks for engaging with my random blog comments! TBC, I think you guys are definitely on the right track in trying to relate SGD to function simplicity, and the empirical work you've done fleshing out that picture is great. I just think it could be even *better* if it was based around a better SGD scaling limit ;)

Therefore, if an optimiser samples functions proportional to their volume, you won’t get any difference in performance if you learn features (optimise the whole network) or do not learn features (randomly initialise and freeze all but the last layer and then train just the last).

Right, this is an even better argument that NNGPs/random-sampled nets don't learn features.

Given therefore that the posteriors are the same, it implies that feature learning is not aiding inductive bias—rather, feature learning is important for expressivity reasons

I think this only applies to NNGP/random-sampled nets, not SGD-trained nets. To apply to SGD-trained nets, you'd need to show that the new features learned by SGD have the same distribution as the features found in an infinitely-wide random net, but I don't think this is the case. By illustration, some SGD-trained nets can develop expressive neurons like 'car detector', enabling them to fit the data with a relatively small number of such neurons. If you used an NNGP to learn the same thing, you wouldn't get a single 'car detector' neuron, but rather some huge linear combination of high-frequency features that can approximate the cars seen in the dataset. I think this would probably generalize worse than the network with an actual 'car detector'(this isn't empirical evidence of course, but I think what we know about SGD-trained nets and the NNGP strongly suggests a picture like this)

Furthermore (and on a slightly different note), it is known that infintesimal GD converges to the Boltzmann distribution for any DNN (very similar to random sampling)

Interesting, haven't seen this before. Just skimming the paper, it sounds like the very small learning rate + added white noise might result in different limiting behavior from usual SGD. Generally it seems that there are a lot of different possible limits one can take; empirically SGD-trained nets do seem to have 'feature learning' so I'm skeptical of limits that don't have that(I assume they don't have them for theoretical reasons, anyway. Would be interesting to actually examine the features found in networks trained like this, and to see if they can do transfer learning at all) re:'colored noise', not sure to what extent this matters. I think a more likely source of discrepancy is the *lack* of white noise in normal training(I guess this counts as 'colored noise' in a sense) and the larger learning rate.

if anyone can point out why this line of argument is not correct, or can steelman a case for SGD inductive bias appearing at larger scales, I would be very interested to hear it.

Not to be a broken record, but I strongly recommend checking out Greg Yang's work. He clearly shows that there exist infinite-width limits of SGD that can do feature/transfer learning.

**interstice**on Parsing Chris Mingard on Neural Networks · 2021-05-12T05:32:34.359Z · LW · GW

I think we basically agree on the state of the empirical evidence -- the question is just whether NTK/GP/random-sampling methods will continue to match the performance of SGD-trained nets on more complex problems, or if they'll break down, ultimately being a first-order approximation to some more complex dynamics. I think the latter is more likely, mostly based on the lack of feature learning in NTK/GP/random limits.

re: the architecture being the source of inductive bias -- I certainly think this is true in the sense that architecture choice will have a bigger effect on generalization than hyperparameters, or the choice of which local optimizer to use. But I do think that using a local optimizer at all, as opposed to randomly sampling parameters, is likely to have a large effect.

**interstice**on Parsing Chris Mingard on Neural Networks · 2021-05-12T05:02:50.779Z · LW · GW

Yeah, I didn't mean to imply that you guys said 'simple --> large volume' anywhere. I just think it's a point worth emphasizing, especially around here where I think people will imagine "Solomonoff Induction-like" when they hear about a "bias towards simple functions"

Also, very briefly on your comment on feature learning—the GP limit is used to calculate the volume of functions locally to the initialization. The fact that kernel methods do not learn features should not be relevant given this interpretation

But in the infinite-width setting, Bayesian inference in general is given by a GP limit, right? Initialization doesn't matter. This means that the arguments for lack of feature learning still go through. It's technically possible that there could be feature learning in *finite*-width randomly-sampled networks, but it seems strange that finiteness would help here(and any such learning would be experimentally inaccessible). This is a major reason that I'm skeptical of the "SGD as a random sampler" picture.

**interstice**on Challenge: know everything that the best go bot knows about go · 2021-05-12T03:25:34.048Z · LW · GW

If your goal is to play as well as the best go bot and/or write a program that plays equally well from scratch, it seems like it's probably impossible. A lot of the go bot's 'knowledge' could well be things like "here's a linear combination of 20000 features of the board predictive of winning". There's no reason for the coefficients of that linear combination to be compressible in any way; it's just a mathematical fact that these particular coefficients happen to be the best at predicting winning. If you accepted "here the model is taking a giant linear combination of features" as "understanding" it might be more doable.

**interstice**on Parsing Chris Mingard on Neural Networks · 2021-05-07T20:11:06.486Z · LW · GW

Is that the empirical evidence attempts to demonstrate simple --> large volume but is inconclusive, or is it that the empirical evidence does not even attempt to demonstrate simple --> large volume?

They don't really try to show simple --> large volume. They show is that there is substantial 'clustering, so *some* simple functions have large volume. I like nostalgebraist's remarks on their clustering measures.

so it seems a little unfair to say that the evidence is that the performance is similar, since that would suggest that they were just comparing max performance by SGD to max performance by NNGP.

Fair point, they do compare the distributions as well. I don't think it's too surprising that they're similar since they compare them on the test points of the distribution which they were trained to fit.

It sounds like you do think there is some chance that neural network generalization is due to an architectural bias towards simplicity

I do, although I'm not sure if I would say 'architectural bias' since I think SGD might play an important role. Unfortunately I don't really have too much substantial to say about this -- Mingard is the only researcher I'm aware of explicitly trying to link networks to simplicity priors. I think the most promising way to make progress here is likely to be analyzing neural nets in some non-kernel limit like Greg Yang's work or this paper.

**interstice**on NTK/GP Models of Neural Nets Can't Learn Features · 2021-05-07T18:52:38.735Z · LW · GW

They would exist in a *sufficiently* big random NN, but their density would be extremely low I think. Like, if you train a normal neural net with a 15000 neurons and then there's a car detector, the density of car detectors is now 1/15000. Whereas I think the density at initialization is probably more like 1/2^50 or something like that(numbers completely made up), so they'd have a negligible effect on the NTK's learning ability('slight tweaks' can't happen in the NTK regime since no intermediate functions change by definition)

A difference with the pruning case is that the number of possible prunings increases exponentially with the number of neurons but the number of neurons is linear. My take on the LTH is that pruning is basically just a weird way of doing optimization so it's not that surprising you can get good performance.

**interstice**on NTK/GP Models of Neural Nets Can't Learn Features · 2021-05-07T17:54:54.393Z · LW · GW

Yeah, that summary sounds right.

I'd say (b) -- it seems quite unlikely to me that the NTK/GP are universally data-efficient, while neural nets might be(although that's mostly speculation on my part). I think the lack of feature learning is a stronger argument that NTK/GP don't characterize neural nets well.

**interstice**on Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian · 2021-05-07T17:45:53.888Z · LW · GW

Yeah, exactly -- the problem is that there are some small-volume functions which are actually simple. The argument for small-volume --> complex doesn't go through since there could be other ways of specifying the function.

Other senses of simplicity include various circuit complexities and Levin complexity. There's no argument that parameter-space volume corresponds to either of them AFAIK(you might think parameter-space volume would correspond to "neural net complexity", the number of neurons in a minimal-size neural net needed to compute the function, but I don't think this is true either -- every parameter is Gaussian so it's unlikely for most to be zero)

**interstice**on Parsing Chris Mingard on Neural Networks · 2021-05-07T03:03:14.564Z · LW · GW

For reasons elaborated upon in this post and its comments, I'm kinda skeptical of these results. Basically the claims made are

(A) That the parameter->function map is "biased towards simple functions". It's important to distinguish simple --> large volume and large volume --> simple. Simple --> large volume is the property that Solomonoff induction has and what makes it universal, but large volume-->simple is what is proven in these papers(plus some empirical evidence of unclear import)

(B) SGD being equivalent to random selection. The evidence is empirical performance of Gaussian processes being similar to neural nets on simple tasks. But this may break down on more difficult problems(link is about the NTK, not GP, but they tend to perform similarly, indeed NTK usually performs better than GP)

Overall I think it's likely we'll need to actually analyze SGD in a non-kernel limit to get a satisfactory understanding of "what's really going on" with neural nets.

**interstice**on NTK/GP Models of Neural Nets Can't Learn Features · 2021-05-07T00:24:10.619Z · LW · GW

There's an important distinction^{[1]} to be made between these two claims:

A) Every function with large volume in parameter-space is simple

B) Every simple function has a large volume in parameter space

For a method of inference to qualify as a 'simplicity prior', you want both claims to hold. This is what lets us derive bounds like 'Solomonoff induction matches the performance of any computable predictor', since all of the simple, computable predictors have relatively large volume in the Solomonoff measure, so they'll be picked out after boundedly many mistakes. In particular, you want there to be an implication like, if a function has complexity , it will have parameter-volume at least .

Now, the Mingard results, at least the ones that have mathematical proof, rely on the Levin bound. This only shows (A), which is the direction that is much easier to prove -- it automatically holds for any mapping from parameter-space to functions with bounded complexity. They also have some empirical results that show there is substantial 'clustering', that is, there are *some* simple functions that have large volumes. But this still doesn't show that all of them do, and indeed is compatible with the learnable function class being extremely limited. For instance, this could easily be the case even if NTK/GP was only able to learn linear functions. In reality the NTK/GP is capable of approximating arbitrary functions on finite-dimensional inputs but, as I argued in another comment, this is not the right notion of 'universality' for classification problems. I strongly suspect^{[2]} that the NTK/GP can be shown to not be 'universally data-efficient' as I outlined there, but as far as I'm aware no one's looked into the issue formally yet. Empirically, I think the results we have so far suggest that the NTK/GP is a decent first-order approximation for simple tasks that tends to perform worse on the more difficult problems that require non-trivial feature learning/efficiency.

I actually posted basically the same thing underneath another one of your comments a few weeks ago, but maybe you didn't see it because it was only posted on LW, not the alignment forum ↩︎

Basically, because in the NTK/GP limit the functions for all the neurons in a given layer are sampled from a single computable distribution, so I think you can show that the embedding is 'effectively finite' in some sense(although note it

*is*a universal approximator for fixed input dimension) ↩︎

**interstice**on AMA: Paul Christiano, alignment researcher · 2021-04-30T17:11:10.317Z · LW · GW

Have you read much philosophy? If so, what are your favorite books/articles?

**interstice**on AMA: Paul Christiano, alignment researcher · 2021-04-30T01:16:58.979Z · LW · GW

Any thoughts on the Neural Tangent Kernel/Gaussian Process line of research? Or attempts to understand neural network training at a theoretical level more generally?

**interstice**on NTK/GP Models of Neural Nets Can't Learn Features · 2021-04-24T21:20:15.826Z · LW · GW

By universal approximation, these features will be sufficient for any downstream learning task

Right, but trying to fit an unknown function with linear combinations of those features might be *extremely data-inefficient*, such that it is basically unusable for difficult tasks. Of course you could do better if you're not restricted to linear combinations -- for instance, if the map is injective you could invert back to the original space and apply whatever algorithm you wanted. But at that point you're not really using the Fourier features at all. In particular, the NTK always learns a linear combination of its features, so it's the efficiency of linear combinations that's relevant here.

I agree that there is no learning taking place and that such a method may be inefficient. However, that goes beyond my original objection.

You originally said that the NTK doesn't learn features because its feature class already has a good representation at initialization. What I was trying to convey (rather unclearly, admittedly) in response is:

A) There exist learning algorithms that have universal-approximating embeddings at initialization yet learn features. If we have an example of P and !Q, P-->Q cannot hold in general, so I don't think it's right to say that the NTK's lack of feature learning is due to its universal-approximating property.

B) Although the NTK's representation may be capable of approximating arbitrary functions, it will probably be very *slow* at learning some of them, perhaps so slow that using it is infeasible. So I would dispute that it already has 'good' representations. While it's universal in one sense, there might be some other sense of 'universal efficiency' in which it's lacking, and where feature-learning algorithms can outperform it.

This is not a trivial question

I agree that in practice there's likely to be some relationship between universal approximation and efficiency, I just think it's worth distinguishing them conceptually. Thanks for the paper link BTW, it looks interesting.

**interstice**on NTK/GP Models of Neural Nets Can't Learn Features · 2021-04-24T02:23:24.218Z · LW · GW

Ah, rereading your original comment more carefully I see that you indeed didn't say anything about 'universal learning'. You're quite right that the NTK is a universal function approximator. My apologies.

However, I still disagree that this is the reason that the NTK doesn't learn features. I think that 'universal function approximation' and 'feature learning' are basically unrelated dimensions along which a learning algorithm can vary. That is, it's quite possible to imagine a learning algorithm which constructs a sequence of different embeddings, *all* of which are universal approximators. The paper by Greg Yang I linked gives an example of such an algorithm(I don't think he explicitly proves this but I'm pretty sure it's true)

What I was trying to get at with the 'universal learning' remarks is that, although the NTK does indeed contain all finite embeddings, I believe that it does not do so in a very efficient way -- it might require disproportionately many training points to pick out what are, intuitively, fairly simple embeddings. I believe this is what is behind the poor performance of empirical NTKs compared to SGD-trained nets, as I brought up in this comment, and ultimately explains why algorithms that do 'feature learning' can outperform those that don't -- the feature learning algorithms are able to find more efficient embeddings for a given set of inputs(of course, it's possible to imagine a *fixed* embedding that's 'optimally efficient' in some way, but as far as I'm aware the NTK has no such property). This issue of 'embedding efficiency' seems only loosely related to the universal approximation property. To formalize this, it would be nice to develop a theory of universal inference in the setting of classification problems akin to Solomonoff induction. To effectively model this in an asymptotic theory, I think it might be necessary to increase the dimension of the model input along with the number of data points, since otherwise all universal approximators for a given dimension will have asymptotically the same performance. Everything in this paragraph is just my personal speculation though, as far as I'm aware there's no existing theory of universal inference in classification problems, so if you found my remarks confusing that's pretty understandable :)

**interstice**on NTK/GP Models of Neural Nets Can't Learn Features · 2021-04-23T21:48:16.174Z · LW · GW

I think nets are usually increased in depth as well as width when they are 'scaled up', so the NTK limit doesn't apply -- the convergence to NTK is controlled by the ratio of depth to width, only approaching a deterministic kernel if this ratio approaches 0.

**interstice**on NTK/GP Models of Neural Nets Can't Learn Features · 2021-04-23T21:29:13.916Z · LW · GW

There's a big difference between 'universal learner' and 'fits any smooth function on a fixed input space'. The 'universal learner' property is about data efficiency: do you have bounded regret compared to any learning algorithm in some wide class? Solomonoff induction has this property with respect to computable predictors on binary strings, for instance. There are lots of learning algorithms able to fit any finite binary sequence but which are not universal. I haven't seen a good formalism for this in the neural net case, but I think it would involve letting the input dimension increase with the number of data points, and comparing the asymptotic performance of various algorithms.

**interstice**on NTK/GP Models of Neural Nets Can't Learn Features · 2021-04-23T20:25:30.733Z · LW · GW

So I read through the Maddox et al. study, and it definitely does *not* show that the NTK can do transfer learning. They pre-train *using SGD* on a single task, then use the NTK computed on the trained network to do Bayesian inference on some other tasks. They say in a footnote on page 9, "Note that in theory, there is no need to train the network at all. We found that it is practically useful to train the network to learn good representations." This makes me suspect that they tried using the NTK to learn the transfer parameters but it didn't work.

Regarding the empirical results about the NTK explaining the performance of neural nets, I found this study interesting. They computed the 'empirical NTK' on some finite-width networks and compared the performance of the solution found by SGD to that found by solving the NTK. For standard widths, the NTK solution performed substantially worse(up to 20% drop in accuracy). The gap closed to some extent, but not completely, upon making the network much wider. The size of the gap also correlated with the complexity of the task(0.5% gap for MNIST, 13% for CIFAR, 18% for a subset of ImageNet). The trajectory of the weights also diverged substantially from the NTK prediction, even on MNIST. All of this seems consistent with the NTK being a decent first-order approximation that breaks down on the really hard tasks that require the networks to do non-trivial feature learning.

**interstice**on NTK/GP Models of Neural Nets Can't Learn Features · 2021-04-23T17:12:46.002Z · LW · GW

Yes, and indeed in the NTK limit we can model ordinary training that way.

**interstice**on NTK/GP Models of Neural Nets Can't Learn Features · 2021-04-23T17:11:34.522Z · LW · GW

NTK doesn’t learn features because the feature class at initialization is a universal class

I've never heard of any result suggesting this, what's your argument? I suspect the opposite -- by the central limit theorem the partial derivatives and activations at each layer tend toward samples from a fixed distribution(differing per layer but fixed across neurons). I think this means that the NTK embedding is 'essentially finite' and actually not universal(though I'm not sure). Note that to show universality it's not enough to show that all embeddings can be found, you'll also need an argument showing that their density in the NTK embedding is bounded above zero.

**interstice**on NTK/GP Models of Neural Nets Can't Learn Features · 2021-04-23T05:33:18.015Z · LW · GW

Hmm, so regarding the linear combinations, it's true that there are *some* linear combinations that will change by in the large-width limit -- just use the vector of partial derivatives of the output at some particular input, this sum will change by the amount that the output function moves during the regression. Indeed, I suspect(but don't have a proof) that these particular combinations will span the space of linear combinations that change non-trivially during training. I would dispute "we expect most linear combinations to change" though -- the CLT argument implies that we should expect almost all combinations to *not* appreciably change. Not sure what effect this would have on the PCA and still think it's plausible that it doesn't change at all(actually, I think Greg Yang states that it doesn't change in section 9 of his paper, haven't read that part super carefully though)

And the tangent kernel not changing does not imply that transfer learning won’t work

So I think I was a bit careless in saying that the NTK can't do transfer learning at all -- a more exact statement might be "the NTK does the minimal amount of transfer learning possible". What I mean by this is, *any* learning algorithm can do transfer learning if the task we are 'transferring' to is sufficiently similar to the original task -- for instance, if it's just the exact same task but with a different data sample. I claim that the 'transfer learning' the NTK does is of this sort. As you say, since the tangent kernel doesn't change at all, the net effect is to move where the network starts in the tangent space. Disregarding convergence speed, the impact this has on generalization is determined by the values set by the old function on axes of the NTK outside of the span of the partial derivatives at the new function's data points. This means that, for the NTK to transfer anything from one task to another, it's not enough for the tasks to both feature, for instance, eyes. It's that the eyes have to correlate with the output in the *exact same way* in both tasks. Indeed, the transfer learning could actually hurt the generalization. Nor is its effect invariant under simple transformations like flipping the sign of the target function(this would change beneficial transfer to harmful). By default, for functions that aren't simple multiples, I expect the linear correlation between values on different axes to be about 0, even if the functions share many meaningful features. So while the NTK can do 'transfer learning' in a sense, it's about as weak as possible, and I strongly doubt that this sort of transfer is sufficient to explain transfer learning's successes in practice(but don't have empirical proof).

I do think the empirical results pretty strongly suggest that the NTK/GP model captures everything important about neural nets, at least in terms of their performance on the original task.

It's true that NTK/GP perform pretty closely to finite nets on the tasks we've tried them on so far, but those tasks are pretty simple and we already had decent non-NN solutions. Generally the pattern is '"GP matches NNs on really simple tasks, NTK on somewhat harder ones". I think the data we have is consistent with this breaking down as we move to the harder problems that have no good non-NN solutions. I would be very interested in seeing an experiment with NTK on, say, ImageNet for this reason, but as far as I know no one's done so because of the prohibitive computational cost.

I only found one directly-relevant study, which is on way too small and simple a system for me to draw much of a conclusion from it, but it does seem to have worked.

Thanks for the link -- will read this tomorrow.

BTW, thanks for humoring me throughout this thread. This is really useful, and my understanding is updating considerably.

And thank you for engaging in detail -- I have also found this very helpful in forcing me to clarify(partially to myself) what my actual beliefs are.

**interstice**on NTK/GP Models of Neural Nets Can't Learn Features · 2021-04-22T18:46:20.922Z · LW · GW

I don't think taking linear combinations will help, because adding terms to the linear combination will also increase the magnitude of the original activation vector -- e.g. if you add together units, the magnitude of the sum of their original activations will with high probability be , dwarfing the O(1) change due to change in the activations. But regardless, it can't help with transfer learning at all, since the tangent kernel(which determines learning in this regime) doesn't change by definition.

What empirical results do you think are being contradicted? As far as I can tell, the empirical results we have are 'NTK/GP have similar performance to neural nets on some, but not all, tasks'. I don't think transfer/feature learning is addressed at all. You might say these results are suggestive evidence that NTK/GP captures everything important about neural nets, but this is precisely what is being disputed with the transfer learning arguments.

I can imagine doing an experiment where we find the 'empirical tangent kernel' of some finite neural net at initialization, solve the linear system, and then analyze the activations of the resulting network. But it's worth noting that this is not what is usually meant by 'NTK' -- that usually includes taking the infinite-width limit at the same time. And to the extent that we expect the activations to change at all, we no longer have reason to think that this linear system is a good approximation of SGD. That's what the above mathematical results mean -- the same mathematical analysis that implies that network training is like solving a linear system, *also* implies that the activations don't change at all.

**interstice**on NTK/GP Models of Neural Nets Can't Learn Features · 2021-04-22T17:03:29.303Z · LW · GW

The result that NTK does not learn features in the large N limit is not in dispute at all -- it's right there on page 15 of the original NTK paper, and indeed holds after arbitrarily many steps of backprop. I don't think that there's really much room for loopholes in the math here. See Greg Yang's paper for a lengthy proof that this holds for all architectures. Also worth noting that when people 'take the NTK limit' they often don't initialize an actual net at all, they instead use analytical expressions for what the inner product of the gradients would be at N=infinity to compute the kernel directly.

**interstice**on NTK/GP Models of Neural Nets Can't Learn Features · 2021-04-22T05:52:53.947Z · LW · GW

The asymmetry between the output function and the intermediate neuron functions comes from backprop -- from the fact that the gradients are backprop-ed through weight matrices with entries of magnitude O(). So the gradient of the output w.r.t itself is obviously 1, then the gradient of the output w.r.t each neuron in the preceding layer is O(), since you're just multiplying by a vector with those entries. Then by induction all other preceding layers' gradients are the sum of N random things of size O(1/N), and so are of size O() again. So taking a step of backprop will change the output function by O(1) but the intermediate functions by O(), vanishing in the large-width limit.

(This is kind of an oversimplification since it is possible to have changing intermediate functions while doing backprop, as mentioned in the linked paper. But this is the essence of why it's possible in *some* limits to move around using backprop without changing the intermediate neurons)

**interstice**on Updating the Lottery Ticket Hypothesis · 2021-04-21T18:50:14.361Z · LW · GW

Yup, people have done this(taking the infinite-width limit at the same time): see here, here. Generally the kernels do worse than the original networks, but not by a lot. On the other hand, they're usually applied to problems that aren't super-hard, where non-neural-net classifiers already worked pretty well. And these models definitely can't explain feature learning, since the functions computed by individual neurons don't change at all during training.

**interstice**on Updating the Lottery Ticket Hypothesis · 2021-04-20T22:40:19.221Z · LW · GW

The tangent-space hypothesis implies something close to this but not quite -- instead of 'dog-detecting neuron', it's 'parameter such that the partial derivative of the output with respect to that parameter, as a function of the input, implements a dog-detector'. This would include (the partial derivative w.r.t.) neurons via their bias.

**interstice**on Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian · 2021-04-18T22:50:25.366Z · LW · GW

Sounds like you might already know this, but it's important to note that the Levin bound only implies that functions with large volumes will be simple -- it doesn't imply that simple functions will have large volume. It's the latter claim that's difficult to show, but also more important for understanding the power of neural nets, since it would imply that they are Solomonoff-induction-like, able to learn arbitrary functions given enough training data. In particular you'd want some sort of exponential *lower* bound on the volume of parameter-space occupied by each function of given complexity. The statement derived from the Levin bound is pretty trivial and would hold even if the network could only represent linear functions, constant functions, or some other quite limited class.

**interstice**on Against "Context-Free Integrity" · 2021-04-15T20:35:29.674Z · LW · GW

Indeed. *Quis custodiet ipsos custodes?*

**interstice**on What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) · 2021-04-09T18:21:01.864Z · LW · GW

Is the following scenario a good example of the sort of problem you have in mind? Say you have two advanced ML systems with values that are partially, but not entirely, aligned with humanity: their utility function is 0.9 * (human values) + 0.1 * (control of resources). These two ML systems have been trained with advanced RL, in such a fashion that, when interacting with other powerful systems, they learn to play Nash equilibria. The only Nash equilibrium of their interaction is one where they ruthlessly compete for resources, making the Earth uninhabitable in the process. So both systems are "pretty much aligned", but their joint interaction is radically unaligned. If this seems like a reasonable example, two thoughts:

A) I think other people in this discussion might be envisioning 'aligned AI' as looking more like an approval-directed agent, rather than a system trained with RL on a proxy for the human utility function. Crucially, in this paradigm the system's long-term planning and bargaining are emergent consequences of what it predicts an (amplified) human would evaluate highly, they're not baked into the RL algorithm itself. This means it would only try to play a Nash equilibrium if it thinks humans would value that highly, which, in this scenario, they would not. In approval-directed AI systems, or more generally systems where strategic behavior is an emergent consequence of some other algorithm, bargaining ability should rise in tandem with general capability, making it unlikely that very powerful systems would have 'obvious' bargaining failures.

B) It seems that systems that are bad at bargaining would also be worse at acquiring resources. For instance, maybe the Nash equilibrium of the above interaction of two RL agents would actually be more like 'try to coordinate a military strike against the other AI as soon as possible', leaving both systems crippled, or to a unipolar scenario(which would be OK given the systems' mostly-aligned utility functions). The scenarios in the post seem to envision systems with some ability to bargain with others, but only for certain parts of their utility function, maybe those that are simple to measure. I think it might be worth emphasizing that more, or describing what kind of RL algorithms would give rise to bargaining abilities that look like that.

**interstice**on Predictive Coding has been Unified with Backpropagation · 2021-04-04T19:58:30.804Z · LW · GW

That's assuming that the brain is using predictive coding to implement backprop, whereas it might instead be doing something that is more computationally efficient given its hardware limitations. (Indeed, the fact that it's so inefficient should make you update that it's not likely for the brain to be doing it)

**interstice**on samshap's Shortform · 2021-03-15T21:26:12.433Z · LW · GW

I mean, there's nothing *wrong* with using the log scoring rule. But since the implied probabilities will change depending on how you aggregate the utilities, it doesn't seem to me that it gets us any closer to a truly objective, consequence-free answer -- 'objective probability' is still meaningless here, it all depends on the bet structure.

**interstice**on Predictions for future dispositions toward Twitter · 2021-03-15T00:56:16.455Z · LW · GW

I'll make a contrary prediction: I think that twitter, used correctly, can be an extremely powerful tool for learning new things & meeting new people, themselves some of the most important things to get right in life. In the future it will be seen as flawed in some ways, but overall a clear positive, and hopefully the progenitor of even better ways of connecting information and people.

**interstice**on samshap's Shortform · 2021-03-14T23:02:47.195Z · LW · GW

Isn't this kind of circular? The justification for the logarithmic scoring rule is that it gets agents to report their true beliefs, in contexts where such beliefs clearly make sense(no anthropic weirdness, in particular), and where agents have utlities linear in money. Extending this as definition to situations where such beliefs don't make sense seems arbitrary.

**interstice**on Strong Evidence is Common · 2021-03-14T00:22:37.068Z · LW · GW

I think the point is that you could become confident of this *before* spending a lot of time/money on the market, if there are other strong forms of evidence correlated with market success.

**interstice**on Above the Narrative · 2021-03-09T20:47:27.443Z · LW · GW

Thanks. Sounds like he had a lot of the pieces beforehand, although I didn't find strong evidence that Satoshi got them from him. Could be an independent re-invention.

**interstice**on Why Hasn't Effective Altruism Grown Since 2015? · 2021-03-09T03:54:20.259Z · LW · GW

Data point: I was definitely in the ''rationalism seemed innately obvious" camp. re: the broader reach of EA, can't confirm or deny either way myself, but here's an alternative perspective.

Maybe the "1% of the 1%" he mentions are the people who naturally take to EA? I also suspect that the undergrads he talks to are far from a random sample of elite-college undergrads. I think the analogy with communism is important -- like any intellectual movement, most of the expected impact of EA probably comes from in its potential to influence key decision-makers at some future pivotal moment.

**interstice**on Above the Narrative · 2021-03-09T02:42:59.589Z · LW · GW

Agree with most of what you say, but isn't the blockchain the one invention where you can really say that it definitely *wasn't* created at a centralized institution? Or are you saying that it's likely that Satoshi was an academic, or got most of his ideas from the academy?

**interstice**on interstice's Shortform · 2021-03-08T21:14:11.413Z · LW · GW

It recently occurred to me that humans' apparent tendency to seek status could emerge without any optimization for such, conscious or subconscious, being built-in to the brain at all. Instead, it could be an emergent consequence of our tendency to preferentially attend to and imitate certain people over others. According to The Secret of Our Success, such imitation can extend down to very low-level patterns of behavior, such as what foods we enjoy eating. So you could imagine peoples' behavior and personalities being determined by a sort of 'attentional darwinism': patterns of behavior that tend to get paid attention to and imitated will become common in the population, while those that do not will dwindle. The end result of this will be that an average person's personality will look approximately like a imitation-optimizer --aka status-seeker -- just like an average organism will look approximately like a fitness optimizer. This would make humans doubly mesa-optimizers, both of status-evolution and gene-evolution. This suggests that extracting a CEV of all humanity might be hard, since many of our terminal values could be local to our particular culture's status-evolution.

**interstice**on Economic Class · 2021-03-06T21:30:19.405Z · LW · GW

Interesting(and funny!). I would appreciate more posts on this topic or other "gears-y rundown from a lawyer" type posts.

**interstice**on I'm still mystified by the Born rule · 2021-03-04T04:21:09.032Z · LW · GW

My own most recent pet theory is that the process of branching is deeply linked to thermalization, so to find model systems we should look to things modeling the flow of heat/entropy -- e.g. a system coupled to two heat baths at different temperatures.

**interstice**on I'm still mystified by the Born rule · 2021-03-04T03:49:41.238Z · LW · GW

It’s easy enough to get a single sensory datum — sample a classical state according to the Born probabilities, sample some coordinates, pretend that there’s an eyeball at those coordinates, record what it sees. But once we’ve done that, how do we get our next sense datum?

This doesn't seem like it should be too hard -- if you have some degrees of freedom which you take as representing your 'eyeball', and a preferred basis of 'measurement states' for that eyeball, repeatedly projecting onto that measurement basis will give sensible results for a sequence of measurements. Key here is that you don't have to project e.g. all the electrons in the universe onto their position basis -- just the eyeball DOF onto their preferred 'measurement basis'(which won't look like projecting the electrons onto their position basis either), and then the relevant entangled DOF in the rest of the universe will automatically get projected onto a sensible 'classical-like' state. The key property about the universe's evolution that would make this procedure sensible is non-interference between the 'branches' produced by successive measurements. i.e. if you project onto two different eyeball states at time 1, then at time 2, those states will be approximately non-interfering in the eyeball basis. This is formalized in the consistent histories approach to QM.

What's somewhat trickier is identifying the DOF that make a good 'eyeball' in the first place, and what the preferred basis should be. More broadly it's not even known what quantum theories will give rise to 'classical-like' states at all. The place to look to make progress here is probably the decoherence literature, also quantum darwinism and Jess Riedel's work.

**interstice**on Are the Born probabilities really that mysterious? · 2021-03-02T07:03:58.721Z · LW · GW

If you view the laws of physics as the minimal program capable of generating our observations, the Born rule is no more problematic than any other part of the laws of physics. If our universe was sampled according to a different rule, it would look completely different, just the same as if the terms in the Lagrangian were changed.

**interstice**on Deepmind has made a general inductor ("Making sense of sensory input") · 2021-02-09T05:35:25.450Z · LW · GW

If a thing has two main distinct parts, it seems reasonable to say that the thing is half part-1 and half part-2. This does not necessarily imply that the parts are equally difficult to create, although that would be a reasonable prior if you didn't know much about how the parts worked.