Densing Law of LLMs
post by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-12-08T19:35:09.244Z · LW · GW · 2 commentsThis is a link post for https://arxiv.org/abs/2412.04315
Contents
2 comments
Authors: Chaojun Xiao, Jie Cai, Weilin Zhao, Guoyang Zeng, Xu Han, Zhiyuan Liu, Maosong Sun.
Abstract (bolding mine):
Large Language Models (LLMs) have emerged as a milestone in artificial intelligence, and their performance can improve as the model size increases. However, this scaling brings great challenges to training and inference efficiency, particularly for deploying LLMs in resource-constrained environments, and the scaling trend is becoming increasingly unsustainable. This paper introduces the concept of ``\textit{capacity density}'' as a new metric to evaluate the quality of the LLMs across different scales and describes the trend of LLMs in terms of both effectiveness and efficiency. To calculate the capacity density of a given target LLM, we first introduce a set of reference models and develop a scaling law to predict the downstream performance of these reference models based on their parameter sizes. We then define the effective parameter size of the target LLM as the parameter size required by a reference model to achieve equivalent performance, and formalize the capacity density as the ratio of the effective parameter size to the actual parameter size of the target LLM. Capacity density provides a unified framework for assessing both model effectiveness and efficiency. Our further analysis of recent open-source base LLMs reveals an empirical law (the densing law) that the capacity density of LLMs grows exponentially over time. More specifically, using some widely used benchmarks for evaluation, the capacity density of LLMs doubles approximately every three months. The law provides new perspectives to guide future LLM development, emphasizing the importance of improving capacity density to achieve optimal results with minimal computational overhead.
Seems like bad news when it comes to proliferation, but good news with respect to weak-forward-passes [LW(p) · GW(p)] and (especially latent) scheming.
2 comments
Comments sorted by top scores.
comment by Vladimir_Nesov · 2024-12-09T03:18:08.370Z · LW(p) · GW(p)
Effective parameter size is defined as the size of the reference model that lets it match the target perfomance if the reference model of that size is trained for 1T tokens (Section 2.4). It's hard to match the performance of models trained for 18T tokens by training a much larger model for 1T tokens (their theoretical assumptions claim this remains possible). When it's 2026-2027 and models are trained for 250T tokens (possibly by repeating the data [LW(p) · GW(p)]), it's going to take large reference models indeed to match their performance by training for only 1T tokens.
comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-12-11T12:59:31.157Z · LW(p) · GW(p)
'That means, around three months, it is possible to achieve performance comparable to current state-of-the-art LLMs using a model with half the parameter size.'
If this trend continues, combined with (better/more extensible) inference scaling laws, it could make LM agents much more competitive on many AI R&D capabilities soon, at much longer horizon tasks.
E.g. - figure 11 from RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts:
Also related: Before smart AI, there will be many mediocre or specialized AIs [LW · GW].