Understanding “Deep Double Descent”

evhub

Understanding “Deep Double Descent”

post by evhub · 2019-12-06T00:00:10.180Z · LW · GW · 51 comments

  Prior work
  “Deep Double Descent”
  Final comments
None
51 comments

If you're not familiar with the double descent phenomenon, I think you should be. I consider double descent to be one of the most interesting and surprising recent results in analyzing and understanding modern machine learning. Today, Preetum et al. released a new paper, “Deep Double Descent,” which I think is a big further advancement in our understanding of this phenomenon. I'd highly recommend at least reading the summary of the paper on the OpenAI blog. However, I will also try to summarize the paper here, as well as give a history of the literature on double descent and some of my personal thoughts.

Prior work

The double descent phenomenon was first discovered by Mikhail Belkin et al., who were confused by the phenomenon wherein modern ML practitioners would claim that “bigger models are always better” despite standard statistical machine learning theory predicting that bigger models should be more prone to overfitting. Belkin et al. discovered that the standard bias-variance tradeoff picture actually breaks down once you hit approximately zero training error—what Belkin et al. call the “interpolation threshold.” Before the interpolation threshold, the bias-variance tradeoff holds and increasing model complexity leads to overfitting, increasing test error. After the interpolation threshold, however, they found that test error actually starts to go down as you keep increasing model complexity! Belkin et al. demonstrated this phenomenon in simple ML methods such as decision trees as well as simple neural networks trained on MNIST. Here's the diagram that Belkin et al. use in their paper to describe this phenomenon:

Belkin et al. describe their hypothesis for what's happening as follows:

All of the learned predictors to the right of the interpolation threshold fit the training data perfectly and have zero empirical risk. So why should some—in particular, those from richer functions classes—have lower test risk than others? The answer is that the capacity of the function class does not necessarily reflect how well the predictor matches the inductive bias appropriate for the problem at hand. [The inductive bias] is a form of Occam’s razor: the simplest explanation compatible with the observations should be preferred. By considering larger function classes, which contain more candidate predictors compatible with the data, we are able to find interpolating functions that [are] “simpler”. Thus increasing function class capacity improves performance of classifiers.

I think that what this is saying is pretty magical: in the case of neural nets, it's saying that SGD just so happens to have the right inductive biases that letting SGD choose which model it wants the most out of a large class of models with the same training performance yields significantly better test performance. If you're right on the interpolation threshold, you're effectively “forcing” SGD to choose from a very small set of models with perfect training accuracy (maybe only one realistic option), thus ignoring SGD's inductive biases completely—whereas if you're past the interpolation threshold, you're letting SGD choose which of many models with perfect training accuracy it prefers, thus allowing SGD's inductive bias to shine through.

I think this is strong evidence for the critical importance of implicit simplicity and speed priors in making modern ML work. However, such biases also produce strong incentives for mesa-optimization [? · GW] (since optimizers are simple, compressed policies) and pseudo-alignment [? · GW] (since simplicity and speed penalties will favor simpler, faster proxies). Furthermore, the arguments for the universal prior and minimal circuits [AF · GW] being malign suggest that such strong simplicity and speed priors could also produce an incentive for deceptive alignment [AF · GW].

“Deep Double Descent”

Now we get to Preetum et al.'s new paper, “Deep Double Descent.” Here are just some of the things that Preetum et al. demonstrate in “Deep Double Descent:”

double descent occurs across a wide variety of different model classes, including ResNets, standard CNNs, and Transformers, as well as a wide variety of different tasks, including image classification and language translation,
double descent occurs not just as a function of model size, but also as a function of training time and dataset size, and
since double descent can happen as a function of dataset size, more data can lead to worse test performance!

Crazy stuff. Let's try to walk through each of these results in detail and understand what's happening.

First, double descent is a highly universal phenomenon in modern deep learning. Here is double descent happening for ResNet18 on CIFAR-10 and CIFAR-100:

And again for a Transformer model on German-to-English and English-to-French translation:

All of these graphs, however, are just showcasing the standard Belkin et al.-style double descent over model size (what Preetum et al. call “model-wise double descent”). What's really interesting about “Deep Double Descent,” however, is that Preetum et al. also demonstrate that the same thing can happen for training time (“epoch-wise double descent”) and a similar thing for dataset size (“sample-wise non-monotonicity”).

First, let's look at epoch-wise double descent. Take a look at these graphs for ResNet18 on CIFAR-10:

There's a bunch of crazy things happening here which are worth pointing out. First, the obvious: epoch-wise double descent is definitely a thing—holding model size fixed and training for longer exhibits the standard double descent behavior. Furthermore, the peak happens right at the interpolation threshold where you hit zero training error. Second, notice where you don't get epoch-wise double descent: if your model is too small to ever hit the interpolation threshold—like was the case in ye olden days of ML—you never get epoch-wise double descent. Third, notice the log scale on the y axis: you have to train for quite a while to start seeing this phenomenon.

Finally, sample-wise non-monotonicity—Preetum et al. find a regime where increasing the amount of training data by four and a half times actually increases test loss (!):

What's happening here is that more data increases the amount of model capacity/number of training epochs necessary to reach zero training error, which pushes out the interpolation threshold such that you can regress from the modern (interpolation) regime back into the classical (bias-variance tradeoff) regime, decreasing performance.

Additionally, another thing which Preetum et al. point out which I think is worth talking about here is the impact of label noise. Preetum et al. find that increasing label noise significantly exaggerates the test error peak around the interpolation threshold. Why might this be the case? Well, if we think about the inductive biases story from earlier, greater label noise means that near the interpolation threshold SGD is forced to find the one model which fits all of the noise—which is likely to be pretty bad since it has to model a bunch of noise. After the interpolation threshold, however, SGD is able to pick between many models which fit the noise and select one that does so in the simplest way such that you get good test performance.

Final comments

I'm quite excited about “Deep Double Descent,” but it still leaves what is in my opinion the most important question unanswered, which is: what exactly are the magical inductive biases of modern ML that make interpolation work so well?

One proposal I am aware of is the work of Keskar et al., who argue that SGD gets its good generalization properties from the fact that it finds “shallow” as opposed to “sharp” minima. The basic insight is that SGD tends to jump out of minima without broad basins around them and only really settle into minima with large attractors, which tend to be the exact sort of minima that generalize. Keskar et al. use the following diagram to explain this phenomena:

The more recent work of Dinh et al. in “Sharp Minima Can Generalize For Deep Nets,” however, calls the whole shallow vs. sharp minima hypothesis into question, arguing that deep networks have really weird geometry that doesn't necessarily work the way Keskar et al. want it to. (EDIT: Maybe not. See this comment [AF(p) · GW(p)] for an explanation of why Dinh et al. doesn't necessarily rule out the shallow vs. sharp minima hypothesis.)

Another idea that might help here is Frankle and Carbin's “Lottery Ticket Hypothesis,” which postulates that large neural networks work well because they are likely to contain random subnetworks at initialization (what they call “winning tickets”) which are already quite close to the final policy (at least in terms of being highly amenable to particularly effective training). My guess as to how double descent works if the Lottery Tickets Hypothesis is true is that in the interpolation regime SGD gets to just focus on the wining tickets and ignore the others—since it doesn't have to use the full model capacity—whereas on the interpolation threshold SGD is forced to make use of the full network (to get the full model capacity), not just the winning tickets, which hurts generalization.

That's just speculation on my part, however—we still don't really understand the inductive biases of our models, despite the fact that, as double descent shows, inductive biases are the reason that modern ML (that is, the interpolation regime) works as well as it does. Furthermore, as I noted previously, inductive biases are highly relevant to the likelihood of possible dangerous phenomenon such as mesa-optimization [? · GW] and pseudo-alignment [? · GW]. Thus, it seems quite important to me to do further work in this area and really understand our models' inductive biases, and I applaud Preetum et al. for their exciting work here.

EDIT: I have now written a follow-up to this post talking more about why I think double descent is important titled “Inductive biases stick around [AF · GW].”

51 comments

Comments sorted by top scores.

comment by John_Maxwell (John_Maxwell_IV) · 2019-12-07T08:27:21.049Z · LW(p) · GW(p)

The decision tree result seemed counterintuitive to me, so I took a look at that section of the paper. I wasn't impressed. In order to create a double descent curve for decision trees, they change their notion of "complexity" (midway through the graph... check Figure 5) from the number of leaves in a tree to the number of trees in a forest. Turns out that right after they change their notion of "complexity" from number of leaves to number of trees, generalization starts to improve :)

I don't see this as evidence for double descent per se. Just that ensembling improves generalization. Which is something we've known for a long time. (And this fact doesn't seem like a big mystery to me. From a Bayesian perspective, ensembling is like using the posterior predictive distribution instead of the MAP estimate. BTW I think there's also a Bayesian story for why flat minima generalize better -- the peak of a flat minimum is a slightly better approximation for the posterior predictive distribution over the entire hypothesis class. Sometimes I even wonder if something like this explains why Occam's Razor works...)

Anyway, the authors' rationale seems to be that once your decision tree has memorized the training set, the only way to increase the complexity of your hypothesis class is by adding trees to the forest. I'd rather they had kept the number of trees constant and only modulated the number of leaves.

However, the decision tree discussion does point to a possible explanation of the double descent phenomenon for neural networks. Maybe once you've got enough complexity to memorize the training set, adding more complexity allows for a kind of "implicit ensembling" which leads to memorizing the training set in many different ways and averaging the results together like an ensemble does.

It's suspicious to me that every neural network case study in the paper modulates layer width. There's no discussion of modulating depth. My guess is they tried modulating depth but didn't get the double descent phenomenon and decided to leave those experiments out.

I think increased layer width fits pretty nicely with my implicit ensembling story. Taking a Bayesian perspective on the output neuron: After there are enough neurons to memorize the training set, adding more leads to more pieces of evidence regarding the final output, making estimates more robust. Which is more or less why ensembles work IMO.

Replies from: johnswentworth, evhub

↑ comment by johnswentworth · 2019-12-09T18:48:39.584Z · LW(p) · GW(p)

the peak of a flat minimum is a slightly better approximation for the posterior predictive distribution over the entire hypothesis class. Sometimes I even wonder if something like this explains why Occam's Razor works...

That's exactly correct. You can prove it via the Laplace approximation [LW · GW]: the "width" of the peak in each principal direction is the inverse of an eigenvalue of the Hessian, and each eigenvalue $λ_{i}$ contributes $- \frac{1}{2} l o g (λ_{i})$ to the marginal log likelihood $l o g P [d a t a | m o d e l]$ . So, if a peak is twice as wide in one direction, its marginal log likelihood is higher by $\frac{1}{2} l o g (2)$ , or half a bit. For models in which the number of free parameters is large relative to the number of data points (i.e. the interesting part of the double-descent curve), this is the main term of interest in the marginal log likelihood.

In Jaynes' Logic of Science book, in the chapter on model comparison, I believe he directly claims that this is how/why Occam's Razor works.

That said, it's not clear that this particular proof would generalize properly to systems which perfectly fit the data. In that case, there's an entire surface on which $P [d a t a | θ]$ is constant, so Laplace needs some tweaking.

↑ comment by evhub · 2019-12-07T21:05:52.434Z · LW(p) · GW(p)

Note that double descent also happens with polynomial regression—see here for an example.

Replies from: John_Maxwell_IV

↑ comment by John_Maxwell (John_Maxwell_IV) · 2019-12-08T22:59:49.034Z · LW(p) · GW(p)

I took a look at the Colab notebook linked from that blog post, and there are some subtleties the blog post doesn't discuss. First, the blog post says they're using gradient descent, but if you look at the notebook, they're actually computing a pseudoinverse. [EDIT: They state a justification in their notebook which I missed at first. See discussion below.] Second, the blog post talks about polynomial regression, but doesn't mention that the notebook uses Legendre polynomials. I'm pretty sure high degree Legendre polynomials look like a squiggle which closely follows the x-axis on [-1, 1] (the interval used in their demo). If you fork the notebook and switch from Legendre polynomials to classic polynomial regression (1, x, x^2, x^3, etc.), the degree 100,000 fit appears to be worse than the degree 20 fit. I searched on Google and Google Scholar, and use of Legendre polynomials doesn't seem to be common practice in ML.

Replies from: preetum-nakkiran

↑ comment by Preetum Nakkiran (preetum-nakkiran) · 2019-12-10T00:54:21.681Z · LW(p) · GW(p)

Hi, I'm one of the authors:

Regarding gradient descent vs. pseudoinverse:

As we mention in the note at the top of the Colab,for the linear regression objective, running gradient descent to convergence (from 0 initialization) is *equivalent* to applying the pseudoinverse -- they both find the minimum-norm interpolating solution.

So, we use the pseudoinverse just for computational efficiency; you would get equivalent results by actually running GD.

You're also right that the choice of basis matters. The choice of basis changes the "implicit bias"
of GD in linear regression, analogous to how the choice of architecture changes the implicit bias of GD on neural nets.

-- Preetum

Replies from: John_Maxwell_IV

↑ comment by John_Maxwell (John_Maxwell_IV) · 2019-12-10T06:24:47.178Z · LW(p) · GW(p)

Thanks Preetum. You're right, I missed that note the first time -- I edited my comment a bit.

It might be illuminating to say "the polynomial found by iterating weights starting at 0" instead of "the polynomial found with gradient descent", since in this case, the inductive bias comes from the initialization point, not necessarily gradient descent per se. Neural nets can't learn if all the weights are initialized to 0 at the start, of course :)

BTW, I tried switching from pseudoinverse to regularized linear regression, and the super high degree polynomials seemed more overfit to me.

comment by Zvi · 2021-01-13T20:33:21.460Z · LW(p) · GW(p)

I've stepped back from thinking about ML and alignment the last few years, so I don't know how this fits into the discourse about it, but I felt like I got important insight here and I'd be excited to include this. The key concept that bigger models can be simpler seems very important.

In my words, I'd say that when you don't have enough knobs, you're forced to find ways for each knob to serve multiple purposes slash combine multiple things, which is messy and complex and can be highly arbitrary, whereas with lots of knobs you can do 'the thing you naturally actually want to do.' And once you get sufficiently powerful, the overfitting danger isn't getting any worse with the extra knobs, so sure, why not?

I also strongly agree with orthonormal that including the follow-up as an addendum adds a lot to this post. If it's worth including this, it's worth including both, even if the follow-up wasn't also nominated.

comment by Gurkenglas · 2019-12-07T02:19:42.726Z · LW(p) · GW(p)

The bottom left picture on page 21 in the paper shows that this is not just regularization coming through only after the error on the training set is ironed out: 0 regularization (1/lambda=inf) still shows the effect.

Can we switch to the interpolation regime early if we, before reaching the peak, tell it to keep the loss constant? Aka we are at loss l* and replace the loss function l(theta) with |l(theta)-l*| or (l(theta)-l*)^2.

Replies from: Isnasene

↑ comment by Isnasene · 2019-12-10T04:50:35.502Z · LW(p) · GW(p)

Can we switch to the interpolation regime early if we, before reaching the peak, tell it to keep the loss constant? Aka we are at loss l* and replace the loss function l(theta) with |l(theta)-l*| or (l(theta)-l*)^2.

Interesting! Given that stochastic gradient descent (SGD) does provide an inductive bias towards models that generalize better, it does seem like changing the loss function in this way could enhance generalization performance. Broadly speaking, SGD's bias only provides a benefit when it is searching over many possible models: it performs badly at the interpolation threshold because the lowish complexity limits convergence to a small number of overfitted models. Creating a loss function that allows SGD more reign over the model it selects could therefore improve generalization.

#1 SGD is inductively biased to more generalizeable models in general

#2 an $(l (θ) - l^{*})^{2}$ loss-function gives all models with near $l^{*}$ a wider local minimum

#3 there are many different models where $l (θ) \approx l^{*}$ at a given level of complexity as long as $l^{*} > 0$

then it's plausible that changing the loss-function in this way will help emphasize SGD's bias towards models that generalize better. Point #1 is an explanation for double-descent. Point #2 seems intuitive to me (it makes the loss-function more convex and flatter when models are better performing) and Point #3 does too: there are many different sets of prediction that will all partially fit the training-dataset and yield the same loss function value of $l^{*}$ , which implies that there are also many different predictive models that yield such a loss function.

To illustrate point #3 above, imagine we're trying to fit the set of training observations ${{\to x}_{1}, {\to x}_{2}, {\to x}_{3}, . . ., {\to x}_{i}, . . . {\to x}_{n}}$ . Fully overfitting this set (getting $l (θ) \approx 0$ ) requires us to get all ${\to x}_{i}$ from $1$ to $n$ correct. However, we can partially overfit this set (getting $l (θ) = l^{*}$ ) in a variety of different ways. For instance, if we get all ${\to x}_{i}$ correct except for ${\to x}_{j}$ , we may have roughly $n$ different ways we can pick ${\to x}_{j}$ that could yield the same $l (θ))$ .[1] Consequently, our stochastic gradient descent process is free to apply its inductive bias to a broad set of models that have similar performances but make different predictions.

[1] This isn't exactly true because getting only one ${\to x}_{j}$ wrong without changing the predictions for other ${\to x}_{i}$ might only be achievable by increasing complexity since some predictions may be correlated with each other but it demonstrates the basic idea

comment by TurnTrout · 2019-12-06T00:25:13.879Z · LW(p) · GW(p)

Great walkthrough! One nitpick:

find a regime where increasing the amount of training data by four and a half times actually decreases test loss (!):

Did you mean to say increases test loss?

Replies from: evhub

↑ comment by evhub · 2019-12-06T02:23:03.652Z · LW(p) · GW(p)

Thanks! And good catch—should be fixed now.

comment by orthonormal · 2021-01-13T04:18:38.664Z · LW(p) · GW(p)

If this post is selected, I'd like to see the followup [LW · GW] made into an addendum—I think it adds a very important piece, and it should have been nominated itself.

Replies from: habryka4

↑ comment by habryka (habryka4) · 2021-01-13T06:06:53.801Z · LW(p) · GW(p)

I agree with this, and was indeed kind of thinking of them as one post together.

comment by Rohin Shah (rohinmshah) · 2019-12-17T09:32:07.672Z · LW(p) · GW(p)

For the newsletter:

OpenAI blog post summary:

This blog post provides empirical evidence for the existence of the _double descent_ phenomenon, proposed in an earlier paper summarized below. Define the _effective model complexity_ (EMC) of a training procedure and a dataset to be the maximum size of training set such that the training procedure achieves a _train_ error of at most ε (they use ε = 0.1). Let's suppose you start with a small, underparameterized model with low EMC. Then initially, as you increase the EMC, the model will achieve a better fit to the data, leading to lower test error. However, once the EMC is approximately equal to the size of the actual training set, then the model can "just barely" fit the training set, and the test error can increase or decrease. Finally, as you increase the EMC even further, so that the training procedure can easily fit the training set, the test error will once again _decrease_, causing a second descent in test error. This unifies the perspectives of statistics, where larger models are predicted to overfit, leading to increasing test error with higher EMC, and modern machine learning, where the common empirical wisdom is to make models as big as possible and test error will continue decreasing.

They show that this pattern arises in a variety of simple settings. As you increase the width of a ResNet up to 64, you can observe double descent in the final test error of the trained model. In addition, if you fix a large overparameterized model and change the number of epochs for which it is trained, you see another double descent curve, which means that simply training longer can actually _correct overfitting_. Finally, if you fix a training procedure and change the size of the dataset, you can see a double descent curve as the size of the dataset decreases. This actually implies that there are points in which _more data is worse_, because the training procedure is in the critical interpolation region where test error can increase. Note that most of these results only occur when there is _label noise_ present, that is, some proportion of the training set (usually 10-20%) is given random incorrect labels. Some results still occur without label noise, but the resulting double descent peak is quite small. The authors hypothesize that label noise leads to the effect because double descent occurs when the model is misspecified, though it is not clear to me what it means for a model to be misspecified in this context.

Opinion:

While I previously didn't think that double descent was a real phenomenon (see summaries later in this email for details), these experiments convinced me that I was wrong and in fact there is something real going on. Note that the settings studied in this work are still not fully representative of typical use of neural nets today; the label noise is the most obvious difference, but also e.g. ResNets are usually trained with higher widths than studied in this paper. So the phenomenon might not generalize to neural nets as used in practice, but nonetheless, there's _some_ real phenomenon here, which flies in the face of all of my intuitions.

The authors don't really suggest an explanation; the closest they come is speculating that at the interpolation threshold there's only ~one model that can fit the data, which may be overfit, but then as you increase further the training procedure can "choose" from the various models that all fit the data, and that "choice" leads to better generalization. But this doesn't make sense to me, because whatever is being used to "choose" the better model applies throughout training, and so even at the interpolation threshold the model should have been selected throughout training to be the type of model that generalized well. (For example, if you think that regularization is providing a simplicity bias that leads to better generalization, the regularization should also help models at the interpolation threshold, since you always regularize throughout training.)

Perhaps one explanation could be that in order for the regularization to work, there needs to be a "direction" in the space of model parameters that doesn't lead to increased training error, so that the model can move along that direction towards a simpler model. Each training data point defines a particular direction in which training error will increase. So, when the number of training points is equal to the number of parameters, the training points just barely cover all of the directions, and then as you increase the number of parameters further, that starts creating new directions that are not constrained by the training points, allowing the regularization to work much better. (In fact, the original paper, summarized below, _defined_ the interpolation threshold as the point where number of parameters equals the size of the training dataset.) However, while this could explain model-wise double descent and training-set-size double descent, it's not a great explanation for epoch-wise double descent.

Original paper (Reconciling modern machine learning practice and the bias-variance trade-off) summary:

This paper first proposed double descent as a general phenomenon, and demonstrated it in three machine learning models: linear predictors over Random Fourier Features, fully connected neural networks with one hidden layer, and forests of decision trees. Note that they define the interpolation threshold as the point where the number of parameters equals the number of training points, rather than using something like effective model complexiy.

For linear predictors over Random Fourier Features, their procedure is as follows: they generate a set of random features, and then find the linear predictor that minimizes the squared loss incurred. If there are multiple predictors that achieve zero squared loss, then they choose the one with the minimum L2 norm. The double descent curve for a subset of MNIST is very pronounced and has a huge peak at the point where the number of features equals the number of training points.

For the fully connected neural networks on MNIST, they make a significant change to normal training: prior to the interpolation threshold, rather than training the networks from scratch, they train them from the final solution found for the previous (smaller) network, but after the interpolation threshold they train from scratch as normal. With this change, you see a very pronounced and clear double descent curve. However, if you always train from scratch, then it's less clear -- there's a small peak, which the authors describe as "clearly discernible", but to me it looks like it could be noise.

For decision trees, if the dataset has n training points, they learn decision trees of size up to n leaves, and then at that point (the interpolation threshold) they switch to having ensembles of decision trees (called forests) to get more expressive function classes. Once again, you can see a clear, pronounced double descent curve.

Opinion:

I read this paper back when summarizing <@Are Deep Neural Networks Dramatically Overfitted?@> and found it uncompelling, and I'm really curious how the ML community correctly seized upon this idea as deserving of further investigation while I incorrectly dismissed it. None of the experimental results in this paper are particularly surprising to me, whereas double descent itself is quite surprising.

In the random Fourier features and decision trees experiments, there is a qualitative difference in the _learning algorithm_ before and after the interpolation threshold, that suffices to explain the curve. With the random Fourier features, we only start regularizing the model after the interpolation threshold; it is not surprising that adding regularization helps reduce test loss. With the decision trees, after the interpolation threshold, we start using ensembles; it is again not at all surprising that ensembles help reduce test error. (See also this comment. [AF(p) · GW(p)]) So yeah, if you start regularizing (via L2 norm or ensembles) after the interpolation threshold, that will help your test error, but in practice we regularize throughout the training process, so this should not occur with neural nets.

The neural net experiments also have a similar flavor -- the nets before the interpolation threshold are required to reuse weights from the previous run, while the ones after the interpolation threshold do not have any such requirement. When this is removed, the results are much more muted. The authors claim that this is necessary to have clear graphs (where training risk monotonically decreases), but it's almost certainly biasing the results -- at the interpolation threshold, with weight reuse, the test squared loss is ~0.55 and test accuracy is ~80%, while without weight reuse, test squared loss is ~0.35 and test accuracy is ~85%, a _massive_ difference and probably not within the error bars.

Some speculation on what's happening here: neural net losses are nonconvex and training can get stuck in local optima. A pretty good way to get stuck in a local optimum is to initialize half your parameters to do something that does quite well while the other half are initialized randomly. So with weight reuse we might expect getting stuck in worse local optima. However, it looks like the training losses are comparable between the methods. Maybe what's happening is that with weight reuse, the half of parameters that are initialized randomly memorize the training points that the good half of the parameters can't predict, which doesn't generalize well but does get low training error. Meanwhile, without weight reuse, all of the parameters end up finding a good model that does generalize well, for whatever reason it is that neural nets do work well.

But again, note that the authors were right about double descent being a real phenomenon, while I was wrong, so take all this speculation with many grains of salt.

Summary of this post:

This post explains deep double descent (in more detail than my summaries), and speculates on its relevance to AI safety. In particular, Evan believes that deep double descent shows that neural nets are providing strong inductive biases that are crucial to their performance -- even _after_ getting to ~zero training loss, the inductive biases _continue_ to do work for us, and find better models that lead to lower test loss. As a result, it seems quite important to understand the inductive biases that neural nets use, which seems particularly relevant for e.g. <@mesa optimization and pseudo alignment@>(@Risks from Learned Optimization in Advanced Machine Learning Systems@).

Opinion:

I certainly agree that neural nets have strong inductive biases that help with their generalization; a clear example of this is that neural nets can learn randomly labeled data (which can never generalize to the test set), but nonetheless when trained on correctly labeled data such nets do generalize to test data. Perhaps more surprising here is that the inductive biases help even _after_ fully capturing the data (achieving zero training loss) -- you might have thought that the data would swamp the inductive biases. This might suggest that powerful AI systems will become simpler over time (assuming an inductive bias towards simplicity). However, this is happening in the regime where the neural nets are overparameterized, so it makes sense that inductive biases would still play a large role. I expect that in contrast, powerful AI systems will be severely underparameterized, simply because of _how much data_ there is (for example, the largest GPT-2 model still underfits the data).

Replies from: evhub, rohinmshah

↑ comment by evhub · 2019-12-17T19:58:27.583Z · LW(p) · GW(p)

But this doesn't make sense to me, because whatever is being used to "choose" the better model applies throughout training, and so even at the interpolation threshold the model should have been selected throughout training to be the type of model that generalized well. (For example, if you think that regularization is providing a simplicity bias that leads to better generalization, the regularization should also help models at the interpolation threshold, since you always regularize throughout training.)

The idea—at least as I see it—is that the set of possible models that you can choose between increases with training. That is, there are many more models reachable within $n + 1$ steps of training than there are models reachable within $n$ steps of training. The interpolation threshold is the point at which there are the fewest reachable models with zero training error, so your inductive biases have the fewest choices—past that point, there are many more reachable models with zero training error, which lets the inductive biases be much more pronounced. One way in which I've been thinking about this is that ML models overweight the likelihood and underweight the prior, since we train exclusively on loss and effectively only use our inductive biases as a tiebreaker. Thus, when there aren't many ties to break—that is, at the interpolation threshold—you get worse performance.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2019-12-18T03:36:13.250Z · LW(p) · GW(p)

since we train exclusively on loss and effectively only use our inductive biases as a tiebreaker

If that were true, I'd buy the story presented in double descent. But we don't do that; we regularize throughout training! The loss usually includes an explicit term that penalizes the L2 norm of the weights, and that loss is evaluated and trained against throughout training, and across models, and regardless of dataset size.

It might be that the inductive biases are coming from some other method besides regularization (especially since some of the experiments are done without regularization iirc). But even then, to be convinced of this story, I'd want to see some explanation of how in terms of the training dynamics the inductive biases act as a tiebreaker, and why that explanation doesn't do anything before the interpolation threshold.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2019-12-18T07:04:13.167Z · LW(p) · GW(p)

Reading your comment again, the first three sentences seem different from the last two sentences. My response above is responding to the last two sentences; I'm not sure if you mean something different by the first three sentences.

↑ comment by Rohin Shah (rohinmshah) · 2019-12-18T06:57:44.943Z · LW(p) · GW(p)

I ended up reading another paper on double descent:

More Data Can Hurt for Linear Regression: Sample-wise Double Descent (Preetum Nakkiran) (summarized by Rohin): This paper demonstrates the presence of double descent (in the size of the dataset) for unregularized linear regression. In particular, we assume that each data point x is a vector in independent samples from N(0, σ^2), and the output is y = βx + ε. Given a dataset of (x, y) pairs, we would like to estimate the unknown β, under the mean squared error loss, with no regularization.

In this setting, when the dimensionality d of the space (and thus number of parameters in β) is equal to the number of training points n, the training data points are linearly independent almost always / with probability 1, and so there will be exactly one β that solves the n linearly independent equalities of the form βx = y. However, such a β must also be fitting the noise variables ε, which means that it could be drastically overfitted, with very high norm. For example, imagine β = [1, 1], and the data points are (-1, 3): 3 and (0, 1): 0 (the data points had errors of +1 and -1 respectively). The estimate will be β = [-3, 0], which is going to generalize very poorly.

As we decrease the number of training points n, so that d > n, there are infinitely many settings of the d parameters of β that satisfy the n linearly independent equalities, and gradient descent naturally chooses the one with minimum norm (even without regularization). This limits how bad the test error can be. Similarly, as we increase the number of training points, so that d < n, there are too many constraints for β to satisfy, and so it ends up primarily modeling the signal rather than the noise, and so generalizing well.

Rohin's opinion: Basically what's happening here is that at the interpolation threshold, the model is forced to memorize noise, and it has only one way of doing so, which need not generalize well. However, past the interpolation threshold, when the model is overparameterized, there are many models that successfully memorize noise, and gradient descent "correctly" chooses one with minimum norm. This fits into the broader story being told in other papers that what's happening is that the data has noise and/or misspecification, and at the interpolation threshold it fits the noise in a way that doesn't generalize, and after the interpolation threshold it fits the noise in a way that does generalize. Here that's happening because gradient descent chooses the minimum norm estimator that fits the noise; perhaps something similar is happening with neural nets.

This explanation seems like it could explain double descent on model size and double descent on dataset size, but I don't see how it would explain double descent on training time. This would imply that gradient descent on neural nets first has to memorize noise in one particular way, and then further training "fixes" the weights to memorize noise in a different way that generalizes better. While I can't rule it out, this seems rather implausible to me. (Note that regularization is not such an explanation, because regularization applies throughout training, and doesn't "come into effect" after the interpolation threshold.)

comment by interstice · 2019-12-06T02:11:39.094Z · LW(p) · GW(p)

Nice survey. The result about double descent even occurring in dataset size is especially surprising.

Regarding the 'sharp minima can generalize' paper, they show that there exist sharp minima with good generalization, not flat minima with poor generalization, so they don't rule out flatness as an explanation for the success of SGD. The sharp minima they construct with this property are also rather unnatural: essentially they multiply the weights of layer 1 by a constant $α$ and divide the weights of layer 2 by the same constant. The piecewise linearity of ReLU means the output function is unchanged. For large $α$ , the network is now highly sensitive to perturbations in layer 2. These solutions don't seem like they would be found by SGD, so it could still be that, for solutions found by SGD, flatness and generalization are correlated.

Replies from: evhub, DavidHolmes

↑ comment by evhub · 2019-12-06T06:00:35.921Z · LW(p) · GW(p)

Ah—thanks for the summary. I hadn't fully read that paper yet, though I knew it existed and so I figured I would link it, but that makes sense. Seems like in that case the flat vs. sharp minima hypothesis still has a lot going for it—not sure how that interacts with the lottery tickets hypothesis, though.

↑ comment by DavidHolmes · 2022-02-15T09:35:55.895Z · LW(p) · GW(p)

I'm not sure I agree with interstice's reading of the 'sharp minima' paper. As I understand it, they show that a given function can be made into a sharp or flat minimum by finding a suitable point in the parameter space mapping to the function. So if one has a sharp minmum that does not generalise (which I think we will agree exists) then one can make the same function into a flat minimum, which will still not generalise as it is the same function! Sorry I'm 2 years late to the party...

Replies from: interstice

↑ comment by interstice · 2022-02-15T20:49:27.524Z · LW(p) · GW(p)

After doing a quick re-skim I still think my reading is correct. Where in the paper do you think they say that, given a sharp minimum, you can produce a flatter one? The theorems in section 4 are all of the form "given a network, we can find an observationally equivalent one with an arbitrarily large spectral norm", a.k.a. given any network, it's always possible to find a sharper equivalent network.

Replies from: DavidHolmes

↑ comment by DavidHolmes · 2022-02-21T21:02:11.364Z · LW(p) · GW(p)

Thank you for the quick reply! I’m thinking about section 5.1 on reparametrising the model, where they write:

every minimum is observationally equivalent to an infinitely sharp minimum and to an infinitely flat min- imum when considering nonzero eigenvalues of the Hessian;

If we stick to section 4 (and so don’t allow reparametrisation) I agree there seems to be something more tricky going on. I initially assumed that I could e.g. modify the proof of Theorem 4 to make a sharp minimum flat by taking alpha to be big, but it doesn’t work like that (basically we’re looking at alpha + 1/alpha, which can easily be made big, but not very small). So maybe you are right that we can only make flat minimal sharp and not conversely. I’d like to understand this better!

Replies from: interstice

↑ comment by interstice · 2022-02-22T02:44:21.943Z · LW(p) · GW(p)

Yes, they do say they can make the minima flatter in section 5.1 -- but, as you say, only by re-parameterizing. This makes the result totally vacuous! Like yeah, obviously if you're allowed to arbitrarily re-define the mapping from parameters to functions you can make the minima as "sharp" or as "flat" as you want. For the more substantive results in section 4, I do believe the direction is always flat --> sharp.

I've been thinking about the flatness results from a compression perspective. Being at a flat minimum reduces your description length since you only have to specify the coordinates of the basin you're in, and flatter minima correspond to wider basins which can be specified with lower-precision coordinates. Re-parameterizing breaks this by introducing extra complexity into the parameter-function map. The ability to always obtain sharper but not flatter minima is kind of analogous to how, given a program, there are infinitely many longer programs replicating its behavior, but finitely many or none shorter such programs.

Replies from: DavidHolmes

↑ comment by DavidHolmes · 2022-02-22T14:37:32.196Z · LW(p) · GW(p)

Vacuous sure, but still true, and seems relevant to me. You initially wrote:

Regarding the 'sharp minima can generalize' paper, they show that there exist sharp minima with good generalization, not flat minima with poor generalization, so they don't rule out flatness as an explanation for the success of SGD.

But, allowing reparametrisation, this seems false? I don't understand the step in your argument where you 'rule out reparametrisation', nor do I really understand what this would mean.

Your comment relating description length to flatness seems nice. To talk about flatness at all (in the definitions I have seen) requires a choice of parametrisation. And I guess your coordinate description is also using a fixed parametrisation, so this seems reasonable. A change of parametrisation will then change both flatness and description length (i.e. 'required coordinate precision').

Replies from: interstice, DavidHolmes

↑ comment by interstice · 2022-02-22T18:01:00.827Z · LW(p) · GW(p)

When I said they didn't produce flat minima with poor generalization, I meant "they didn't produce flat minima with poor generalization in the normal parameter space of a neural network". This is what is relevant to the "flatness as an explanation for generalization" hypothesis, since that hypothesis is about how flatness correlates with generalization in neural networks in practice. That there exist other parameter-function mappings where this is not the case does not refute this. Of course, if you wished to produce a mathematical proof that flat minima generalize this would be an important example to keep in mind -- but this post was about high-level scientific hypotheses about neural network generalization, not mathematical proofs. In that context I think it's correct to say that the paper does not provide a meaningful counterexample.

↑ comment by DavidHolmes · 2022-02-22T14:42:06.033Z · LW(p) · GW(p)

p.s.

For the more substantive results in section 4, I do believe the direction is always flat --> sharp.

I agree with this (with 'sharp' replaced by 'generalise', as I think you intend). It seems to me potentially interesting to ask whether this is necessarily the case.

Replies from: interstice

↑ comment by interstice · 2022-02-22T18:20:38.326Z · LW(p) · GW(p)

When I said "the direction is always flat-->sharp", I meant that their theorems showed you could produce a sharp minimum given a flat one, but not the other way around, sorry if I was unclear.

Definitely agreed that "under which conditions does flatness imply generalization" is a very interesting question. I think this paper has a reasonably satisfying analysis, although I also have some reservations [LW(p) · GW(p)] about "SGD as Bayesian sampler" picture.

comment by abramdemski · 2020-12-13T00:40:30.321Z · LW(p) · GW(p)

This is perhaps the most striking fundamental discoveries of machine learning in the past 20 years, and Evan's post is well-deserving of a nomination for explaining it to LW.

comment by johnswentworth · 2019-12-06T00:39:37.060Z · LW(p) · GW(p)

Off the top of your head, do you know anything about/have any hypotheses about how double descent interacts with the gaussian processes interpretation [LW · GW] of deep nets? It seems like the sort of theory which could potentially quantify the inductive bias of SGD.

Replies from: interstice

↑ comment by interstice · 2019-12-06T02:15:32.628Z · LW(p) · GW(p)

The neural tangent kernel guys have a paper where they give a heuristic argument explaining the double descent curve(in number of parameters) using the NTK.

comment by Ben Pace (Benito) · 2019-12-09T19:57:00.542Z · LW(p) · GW(p)

This post was on Hacker News for a while.

comment by DanielFilan · 2021-01-09T01:32:03.937Z · LW(p) · GW(p)

I think this paper does a good job at collecting papers about double descent into one place where they can be contrasted and discussed.
I am not convinced that deep double descent is a pervasive phenomenon in practically-used neural networks, for reasons described in Rohin’s opinion about Preetum et. al. [LW(p) · GW(p)]. This wouldn’t be so bad, except the limitations of the evidence (smaller ResNets than usual, basically goes away without label noise in image classification, some sketchy choices made in the Belkin et al experiments) are not really addressed or highlighted, which I think has a real prospect of misleading the reader. For instance, someone reading the post could be forgiven for not realizing that the colour plots of double descent in ResNet-18s only hold for 15% label noise.
Related to the above, the comments on this post seem pretty valuable to me, in terms of explaining non-obvious aspects and implications of the discussed paper. The speculation about the lottery ticket hypothesis is interesting but not obviously true. Papers that I have found useful for understanding this phenomenon include Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask and Linear Mode Connectivity and the Lottery Ticket Hypothesis.
Overall, my guess is that it is a mistake to think of double descent as a fact about modern machine learning, rather than a plausible hypothesis.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2021-01-13T18:09:31.763Z · LW(p) · GW(p)

Fwiw, I really liked Rethinking Bias-Variance Trade-off for Generalization of Neural Networks (summarized in AN #129), and I think I'm now at "double descent is real and occurs when (empirical) bias is high but later overshadowed by (empirical) variance". (Part of it is that it explains a lot of existing evidence, but another part is that my prior on an explanation like that being true is much higher than almost anything else that's been proposed.)

I was pretty uncertain about the arguments in this post and the followup [LW · GW] when they first came out. (More precisely, for any underlying claim; my opinions about the truth value of the claim seemed nearly independent of my beliefs in double descent.) I'd be interested in seeing a rewrite based on the bias-variance trade-off explanation; my current guess is that they won't hold up, but I haven't thought about it much.

Replies from: DanielFilan

↑ comment by DanielFilan · 2021-02-22T23:57:47.959Z · LW(p) · GW(p)

Finally got around to that one, and am also pretty into that explanation for the cases of double descent we observe. It also tentatively makes me want to say that the decrease in variance with model size is the 'real story'/primary thing we should think about.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-12-02T14:14:50.264Z · LW(p) · GW(p)

I'm no ML expert, but thanks to this post I feel like I have a basic grasp of some important ML theory. (It's clearly written and has great graphs.) This is a big deal because this understanding of deep double descent has shaped my AI timelines to a noticeable degree.

comment by [deleted] · 2020-08-29T16:48:59.338Z · LW(p) · GW(p)

I want to point out some recent work by Andrew Gordon Wilson's group - https://cims.nyu.edu/~andrewgw/#papers.

Particularly, https://arxiv.org/abs/2003.02139 takes a look a double descent from the perspective where they argue that parameters are a bad proxy of model complexity/capacity. Rather, effective dimensionality is what we should be plotting against and double descent effectively vanishes (https://arxiv.org/abs/2002.08791) when we use Bayesian model averaging instead of point estimates.

comment by Mark Xu (mark-xu) · 2020-12-18T15:39:36.570Z · LW(p) · GW(p)

This post gave a slightly better understanding of the dynamics happening inside SGD. I think deep double descent is strong evidence that something like a simplicity prior exists in SGG, which might have actively bad generalization properties, e.g. by incentivizing deceptive alignment. I remain cautiously optimistic that approaches like Learning the Prior can get circumnavigate this problem.

comment by johnswentworth · 2020-12-02T19:42:23.056Z · LW(p) · GW(p)

I found this post interesting and helpful, and have used it as a mental hook on which to hang other things. Interpreting what's going on with double descent, and what it implies, is tricky, and I'll probably write a proper review at some point talking about that.

comment by ESRogs · 2019-12-06T21:00:12.181Z · LW(p) · GW(p)

Before the interpolation threshold, the bias-variance tradeoff holds and increasing model complexity leads to overfitting, reducing test error.

Should that read, "... increasing test error"?

Replies from: evhub

↑ comment by evhub · 2019-12-06T23:34:32.033Z · LW(p) · GW(p)

Yep; good catch!

comment by Peter Wildeford (peter_hurford) · 2019-12-06T14:58:49.376Z · LW(p) · GW(p)

This paper also seems helpful: https://arxiv.org/pdf/1812.11118.pdf

comment by J L (jefferson-lee) · 2022-03-16T07:28:02.243Z · LW(p) · GW(p)

Apologies if it's obvious, but why the focus on SGD? I'm assuming it's not meant as shorthand for other types of optimization algorithms given the emphasis on SGD's specific inductive bias, and the Deep Double Descent paper mentions that the phenomena hold across most natural choices in optimizers.

Replies from: evhub

↑ comment by evhub · 2022-03-16T20:58:38.542Z · LW(p) · GW(p)

SGD is meant as a shorthand that includes other similar optimizers like Adam.

comment by Mark Xu (mark-xu) · 2020-09-21T05:34:37.003Z · LW(p) · GW(p)

https://arxiv.org/abs/1806.00952 gives a theoretical argument that suggests SGD will converge to a point that is very close in L2 norm to the initialization. Since NNs are often initialized with extremely small weights, this amounts to implicit L2 regularization.

comment by FactorialCode · 2019-12-06T18:42:53.042Z · LW(p) · GW(p)

I wonder if this is a neural network thing, an SGD thing, or a both thing? I would love to see what happens when you swap out SGD for something like HMC, NUTS or ATMC if we're resource constrained. If we still see the same effects then that tells us that this is because of the distribution of functions that neural networks represent, since we're effectively drawing samples from an approximation to the posterior. Otherwise, it would mean that SGD is plays a role.

what exactly are the magical inductive biases of modern ML that make interpolation work so well?

Are you aware of this work and the papers they cite?

From the abstract:

We prove that the binary classifiers of bit strings generated by random wide deep neural networks with ReLU activation function are biased towards simple functions. The simplicity is captured by the following two properties. For any given input bit string, the average Hamming distance of the closest input bit string with a different classification is at least sqrt(n / (2{\pi} log n)), where n is the length of the string. Moreover, if the bits of the initial string are flipped randomly, the average number of flips required to change the classification grows linearly with n. These results are confirmed by numerical experiments on deep neural networks with two hidden layers, and settle the conjecture stating that random deep neural networks are biased towards simple functions. This conjecture was proposed and numerically explored in [Valle Pérez et al., ICLR 2019] to explain the unreasonably good generalization properties of deep learning algorithms. The probability distribution of the functions generated by random deep neural networks is a good choice for the prior probability distribution in the PAC-Bayesian generalization bounds. Our results constitute a fundamental step forward in the characterization of this distribution, therefore contributing to the understanding of the generalization properties of deep learning algorithms.

I would field the hypothesis that large volumes of neural network space are devoted to functions that are similar to functions with low K-complexity, and small volumes of NN-space are devoted to functions that are similar to high K-complexity functions. Leading to a Solomonoff-like prior over functions.

Replies from: evhub

↑ comment by evhub · 2019-12-06T19:37:44.059Z · LW(p) · GW(p)

I wonder if this is a neural network thing, an SGD thing, or a both thing?

Neither, actually—it's more general than that. Belkin et al. show that it happens even for simple models like decision trees. Also see here for an example with polynomial regression.

Are you aware of this work and the papers they cite?

Yeah, I am. I definitely think that stuff is good, though ideally I want something more than just “approximately K-complexity.”

comment by avturchin · 2019-12-06T10:20:54.671Z · LW(p) · GW(p)

I observed and read about that it also happens with human learning. On the third lesson of X, I reached perfomance that I was not able reach again until 30th lesson.

Replies from: Pattern

↑ comment by Pattern · 2019-12-07T18:24:30.241Z · LW(p) · GW(p)

Are there papers on this? (Or did you find it somewhere else?)

Replies from: avturchin

↑ comment by avturchin · 2019-12-08T11:10:00.944Z · LW(p) · GW(p)

I read it somewhere around 10 years ago and don't remember the source. However, I remember an explanation they provided: that "correct answers" propagate quicker through brain's neural net, but later they become silenced by errors which arrive through longer trajectories. Eventually the correct answer is reinforced by learning and becomes strong again.

comment by Azade Rezaeezade (azade-rezaeezade) · 2022-01-03T15:00:37.155Z · LW(p) · GW(p)

Hi all, I am a bit new in deep learning filed and actually I use CNN and MLP to do side channel attacks. To do this attack I train models with power or electromagnetic signals leaking from small devices. It means that the training data I use are too noisy. The observation that brought me here was that when I increased the training sample size (from 50000, to 450000 with 50000 step size) my model performance did not increase or even decreased(we define model performance with a different metric, not the accuracy or other common ML performance metrics). So in the first glance I thought it could be a result of "sample-wise non-monotonicity" and it means that if I choose a deep model that is big enough for my sample size then I can run away from this non- monotonicity. Unfortunately it was not the case and even when I chose a model that has 5 times more parameters than my biggest training samples it still did not show increase in performance after increasing the training sample size. Now the point that comes to my mind is that maybe it is because my data is too noisy. In the Nakkiran et. al paper I saw more or less the impact of lable noise. But I am intrested to hear your oppinion about data noise.

Understanding “Deep Double Descent”

Contents

Prior work

“Deep Double Descent”

Final comments

51 comments