# [AN #125]: Neural network scaling laws across multiple modalities

post by rohinmshah · 2020-11-11T18:20:04.504Z · LW · GW · 7 comments

## Contents

SECTIONS
HIGHLIGHTS
TECHNICAL AI ALIGNMENT
MESA OPTIMIZATION
FORECASTING
OTHER PROGRESS IN AI
REINFORCEMENT LEARNING
HIGHLIGHTS
TECHNICAL AI ALIGNMENT
OPTIMIZATION

OTHER PROGRESS IN AI
LEARNING
FEEDBACK
PODCAST
None

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

# SECTIONS

HIGHLIGHTS

TECHNICAL AI ALIGNMENT

MESA OPTIMIZATION

FORECASTING

OTHER PROGRESS IN AI

REINFORCEMENT LEARNING

# HIGHLIGHTS

Scaling Laws for Autoregressive Generative Modeling (Tom Henighan, Jared Kaplan, Mor Katz et al) (summarized by Asya): This paper looks at scaling laws for generative Transformer models of images (predicting pixels or parts of image encodings), videos (predicting frames of image encodings), multimodal image <-> text (predicting captions based on images or images based on captions), and mathematical problem solving (predicting answers to auto-generated questions about algebra, arithmetic, calculus, comparisons, integer properties, measurement, polynomials, and probability). The authors find that:

- Cross-entropy loss as a function of compute follows a power law + constant in all these data modalities (just as it does in language (AN #87)). Information theoretically, this can be interpreted as scaling a 'reducible loss' which estimates the KL divergence between the true and model distributions, and an 'irreducible loss' which estimates the entropy of the true data distribution.

- Performance on ImageNet classification fine-tuned from their generative image model also follows such a power law, whereas ImageNet classification trained from scratch actually gets worse with sufficiently large model sizes. Interestingly, this classification power law continues even past model sizes where the generative cross-entropy loss starts bending as a result of irreducible loss. The authors conclude that approaching the irreducible loss for some dataset does not necessarily indicate diminishing returns for representation quality or semantic content.

- Optimal model size as a function of compute follows a power law with an exponent very close to ~0.7 for all data modalities they've studied so far. This implies that in the current compute regime, as compute budgets grow, it's best to devote a majority of compute towards making models bigger and a minority towards training on more data.

- Larger models perform better on extrapolating to math problems more difficult than those seen in training, but only insofar as they do better on the training distribution (no benefits to 'strong generalization').

- Larger models are able to take advantage of more multimodal information, but the scaling is extremely slow-- a 1-billion-parameter model uses 10% of the information in a caption to define an image, while using 20% of the information would require a 3-trillion-parameter model.

As in the language models paper (AN #87), extrapolating the steep power laws found for optimally-used compute seems to eventually paradoxically result in loss lower than the bound given by shallower power laws for optimally-used training data. The authors offer a potential hypothesis for resolving this inconsistency-- in the regime of less compute and smaller model sizes, increasing model size effectively increases the amount of information you extract from each data point you train on, resulting in the steepness of the current compute law. As compute increases past a certain point, however, the amount of information extracted per data point approaches the maximum amount possible, so the curve switches to a shallower regime and marginal compute should be used increasingly on dataset increases rather than model size increases. If this hypothesis is true, we should eventually expect the scaling laws for compute to bend towards laws set by dataset size, and perhaps should think they will ultimately be set by trends for overfitting (see this post [AF · GW] for another explanation of this).

Read more: the scaling “inconsistency”: openAI’s new insight [AF · GW]

Asya's opinion: I would also recommend listening to Jared Kaplan's talk on this.

I was really excited to learn about more empirical work here. These results suggest that scaling behavior predictable with smooth power-laws is likely a feature of most generative models, not just text. I found it surprising that optimal model size given a compute budget scales the same way across data modalities-- it does seem to suggest that there's something more fundamental going on here that I don't understand (but which may be explained in this theory paper that I haven't read). It's also interesting that pretraining on a generative model (rather than training from scratch) seems to confer real benefits to scaling behavior for image classification-- this lends some support to the view that a lot of the learning that needs to happen will come from unsupervised settings.

A lot of the most salient questions around current scaling laws for me still lie in the translation between cross-entropy loss in these domains and performance on downstream tasks we care about. I feel very unsure about whether any of the fine-tuned generative models we (currently) have the data to train are likely to have transformative performance within even the next 5 orders of magnitude of compute scaling.

Rohin's opinion: In addition to the points Asya made above, I wanted to speculate on the implications of these scaling laws for AGI. I was particularly struck by how well these scaling laws seem to fit the data. This was also true in the case of mathematics problems, at least for the models we have so far, even though intuitively math requires “reasoning”. This suggests to me that even for tasks that require reasoning, capability will increase smoothly along a spectrum, and the term “reasoning” is simply a descriptor of a particular capability level. (An alternative position is that “reasoning” happens only to the extent that the neural net is implementing an algorithm that can justifiably be known to always output the right answer, but this sort of definition usually implies that humans are not doing reasoning, which seems like a deal-breaker.)

Note however that we haven't gotten to the level of performance that would be associated with "reasoning", so it is still possible that the trends stop holding and reasoning then leads to some sort of discontinuous increase in performance. I just wouldn't bet on it.

# TECHNICAL AI ALIGNMENT

## MESA OPTIMIZATION

Confucianism in AI Alignment [AF · GW] (John Wentworth) (summarized by Rohin): Suppose we trained our agent to behave well on some set of training tasks. Mesa optimization (AN #58) suggests that we may still have a problem: the agent might perform poorly during deployment, because it ends up optimizing for some misaligned mesa objective that only agrees with the base objective on the training distribution.

This post suggests that in any training setup in which mesa optimizers would normally be incentivized, it is not sufficient to just prevent mesa optimization from happening. The fact that mesa optimizers could have arisen means that the incentives were bad. If you somehow removed mesa optimizers from the search space, there would still be a selection pressure for agents that without any malicious intent end up using heuristics that exploit the bad incentives. As a result, we should focus on fixing the incentives, rather than on excluding mesa optimizers from the search space.

Clarifying inner alignment terminology [AF · GW] (Evan Hubinger) (summarized by Rohin): This post clarifies the author’s definitions of various terms around inner alignment. Alignment is split into intent alignment and capability robustness, and then intent alignment is further subdivided into outer alignment and objective robustness. Inner alignment is one way of achieving objective robustness, in the specific case that you have a mesa optimizer. See the post for more details on the definitions.

Rohin's opinion: I’m glad that definitions are being made clear, especially since I usually use these terms differently than the author. In particular, as mentioned in my opinion on the highlighted paper, I expect performance to smoothly go up with additional compute, data, and model capacity, and there won’t be a clear divide between capability robustness and objective robustness. As a result, I prefer not to divide these as much as is done in this post.

## FORECASTING

Measuring Progress in Deep Reinforcement Learning Sample Efficiency (Anonymous) (summarized by Asya) (H/T Carl Shulman): This paper measures historic increases in sample efficiency by looking at the number of samples needed to reach some fixed performance level on Atari games and virtual continuous control tasks. The authors find exponential progress in sample efficiency, with estimated doubling times of 10 to 18 months on Atari, 5 to 24 months on state-based continuous control, and 4 to 9 months on pixel-based continuous control, depending on the specific task and performance level. They find that these gains were mainly driven by improvements in off-policy and model-based deep RL learning approaches, as well as the use of auxiliary learning objectives to speed up representation learning, and not by model size improvements. The authors stress that their study is limited in studying only the published training curves for only three tasks, not accounting for the extent to which hyperparameter tuning may have been responsible for historic gains.

Asya's opinion: Following in the footsteps of AI and Efficiency (AN #99), here we have a paper showing exponential gains in sample efficiency in particular. I'm really glad someone did this analysis-- I think I'm surprised by how fast progress is, though as the paper notes it's unclear exactly how to relate historic improvements on fixed task performance to a sense of overall improvement in continuous control (though several of the main contributors listed in the appendix seem fairly general). I also really appreciate how thorough the full paper is in listing limitations to this work.

Since these papers are coming up in the same newsletter, I'll note the contrast between the data-unlimited domains explored in the scaling laws paper and the severely data-limited domain of real-world robotics emphasized in this paper. In robotics, it seems we are definitely still constrained by algorithmic progress that lets us train on fewer samples (or do better transfer from simulations (AN #72)). Of course, maybe progress in data-unlimited domains will ultimately result in AIs that make algorithmic progress in data-limited domains faster than humans ever could.

# OTHER PROGRESS IN AI

## REINFORCEMENT LEARNING

DeepSpeed: Extreme-scale model training for everyone (DeepSpeed Team et al) (summarized by Asya): In this post, Microsoft announces updates to DeepSpeed, its open-source deep learning training optimization library. The new updates include:

- '3D parallelism', a scheme for carefully optimizing how training runs are split across machines. Training runs that use 3D parallelism demonstrate linear scaling of GPU memory and compute efficiency, enabling the theoretical training of extremely large models of over a trillion parameters on as few as 800 NVIDIA V100 GPUs.

- 'ZeRO-Offload', which allows CPU memory to be used during training runs, enabling running models of up to 13 billion parameters on a single NVIDIA V100 GPU.

- 'DeepSpeed Sparse Attention', an instrumental technology that reduces the compute and memory requirements of attention computations used in models like Transformers. Compared to models that use densely computed attention, this enables models that pay attention to sequences that are 10x longer and can be trained up to 6.3x faster.

- '1-bit Adam', a scheme for compressing the communication requirements between machines doing training runs that use the Adam gradient descent optimizer. 1-bit Adam enables up to 5x less communication and up to 3.5x faster training runs.

Fast reinforcement learning through the composition of behaviours (André Barreto et al) (summarized by Flo): While model-based RL agents can easily adapt their policy to changed rewards on the same environment, planning is expensive and learning good models can be challenging for many tasks. On the other hand, it is challenging to get model-free agents to adapt their policy to a new reward without extensive retraining. An intermediate solution is to use so-called successor features: Instead of a value function V(π,s) representing the expected discounted reward for a policy π starting in state s, successor features are a vector-valued value function ψ(π,s) representing an expected discounted feature vector ϕ. If our reward equals r = w ⋅ ϕ for some weight vector w, we can easily obtain the original value function by taking the scalar product of the successor features and the weight vector: V(π,s) = w ⋅ ψ(π,s). Successor features thus allow us to evaluate a fixed policy π for all rewards that are linear in ϕ, which is called generalized policy evaluation.

Now that we can evaluate policies for different preferences, we would like to efficiently find a good policy for a given novel preference. Inspired by human learning that often combines previously learned skills, we employ generalized policy improvement. In vanilla policy improvement, we improve upon a policy π we can evaluate by choosing the action that maximizes the immediate reward plus the discounted value V(π,s') of following π starting in the next state s'. In generalized policy improvement, we have multiple policies and choose the action that maximizes the reward plus the discounted value of following the best of these policies starting in the next state s'. To obtain a policy for the new preference, we "stitch together" all policies we learnt for previous preferences and the resulting policy performs at least as good as all of the old policies with respect to the new preference. As generalized policy improvement does not require any additional environment samples, it enables zero-shot transfer to new preferences. Empirically, even if the weight vector w has to be learnt from reward signals, generalized policy improvement is very sample efficient. Additional samples can then be used to further improve the policy using standard RL.

Flo's opinion: I really like the idea of successor features. Similar to model-based systems, they allow us to evaluate policies for many different rewards, which can be useful for anticipating problematic behaviour before deploying a system. However, note that we still need to execute the policy we obtained by generalized policy improvement to evaluate it for different rewards: The only guarantees we have is that it is better than the previous policies for the reward for which the improvement step was carried out (and potentially some weaker bounds based on the similarity of different rewards).

γ-Models: Generative Temporal Difference Learning for Infinite-Horizon Prediction (Michael Janner et al) (summarized by Flo): Long planning horizons are often necessary for competitive performance of model-based agents, but single-step models get less and less accurate with longer planning horizons as errors accumulate. Model-free algorithms don't have this problem but are usually reward- and policy-specific, such that transfer to other tasks can be hard. The paper proposes policy-specific γ-models as an intermediate solution: instead of learning the distribution of the next state given a state-action pair (s,a), or the final state of an n-step rollout given (s,a) and a policy π, it learns the distribution of a rollout with a stochastic, geometrically distributed length. Unlike for n-step models with n>1, the distribution follows a Bellman-style decomposition into the single-step distribution and the discounted distribution for the next state s', which allows for off-policy training of the model by bootstrapping the target distribution.

Now, if rewards are consequentialist in the sense that they only depend on the state, the expected reward under this distribution is equal to 1-γ times the Q-value for π of (s,a) such that we can use the model for policy evaluation given arbitrary consequentialist rewards. Similar to how single-step models (0-models) can be rolled out to obtain (less accurate) multi-step models, sequential rollouts of a γ-model can be reweighed to obtain a γ-model with larger γ. While this introduces some error, it reduces the bootstrap error during training, which grows with γ. Being able to interpolate between rollouts of single-step models that accumulate error during testing and models with large γ that accumulate error during training allows us to find a sweet spot between the two extremes.

In practice, single-step models are often used for model-based value expansion (MVE), where only N steps are rolled out and a value function is used for evaluating longer-term consequences. The authors' algorithm, γ-MVE instead uses N rollouts of the γ-model and adjusts the weighing of the value function accordingly. γ-MVE performs strongly both in terms of sample efficiency and final performance on a set of low-dimensional continuous control tasks.

Flo's opinion: I am a bit surprised that this works so well, as both bootstrapping and learning generative models for distributions can be unstable and the method combines both. On the other hand, there is a long tradition of continuous interpolations between different RL algorithms and their performance at the sweet spot is often significantly stronger than at the extremes.

#### FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

#### PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

comment by adamShimi · 2020-11-12T00:29:36.711Z · LW(p) · GW(p)

As always, thanks to everyone involved for the newsletter! I'm usually particularly interested in the other RL/Deep Learning papers, as those are the ones I have less chance to find on my own.

On this newsletter, I especially enjoyed the summaries and opinions about the two scaling papers, and the comparison between the two.

comment by Charlie Steiner · 2020-11-11T23:54:41.120Z · LW(p) · GW(p)

Cool papers, Flo!

comment by adamShimi · 2020-11-11T18:34:43.123Z · LW(p) · GW(p)

I don't know what happened, but for me the visuals are a mess. Each part of the newsletter is separated by many screens of blank space, the comment section appears just after the "podcast" section, which makes sense, except that I see everything else below the "podcast" section, even if it should be the last section.

Replies from: rohinmshah
comment by rohinmshah · 2020-11-11T19:11:05.182Z · LW(p) · GW(p)

Known problem [LW(p) · GW(p)], should be fixed in the next few hours.