After 13 months of inactivity, Wang Li, former Vice President of Security Research at OpenAI and co-founder of Thinking Machines Lab, published a long technical article titled "Scaling Laws, Carefully" on her personal blog Lil'Log, which she called "over three years late." This article re-analyzes the Scaling Laws that have supported hundreds of billions of dollars in investment in large model industries, and its core conclusion has left many professionals unsettled: the current data ratio for models may have been off from the beginning.
From Kaplan to Chinchilla: A Reversed Industry Consensus
The story began in 2020 when OpenAI researcher Jared Kaplan published a paper proposing that training loss decreases in a neat power law on log-log coordinates with respect to the number of parameters, data, and compute. The conclusion was that model size should grow faster than data. GPT-3 was the product of this conclusion: 175 billion parameters but only 300 billion tokens of training data.
Two years later, the DeepMind team conducted a larger-scale experiment that overturned this conclusion. They compared Gopher with 28 billion parameters and Chinchilla with 7 billion parameters under the same computing power. The latter had only a quarter of the parameters of the former, but four times more training data — and the result was that Chinchilla outperformed Gopher in all evaluations. Chinchilla revealed that parameters and data should grow proportionally, with an optimal ratio of about 1:20, rather than the parameter explosion and slow data growth proposed by Kaplan. This also explains why later models such as Llama and DeepSeek had fewer parameters than GPT-3 but far superior performance.
Wang Li analyzed the root cause of Kaplan's bias: the largest model in his experiment had only 1.5 billion parameters, and the differences in fitting within the small-scale range were extrapolated to the trillion-scale level, leading to systematic errors. At the same time, Kaplan excluded the counting of embedding layer parameters, a factor that had a significant impact on small models. More surprisingly, in 2024, the Epoch AI team replicated the Chinchilla fitting code line by line and found two bugs — the loss function was averaged instead of summed, causing the optimizer to misjudge convergence, and the core power law exponent was rounded to two digits, creating false precision. After correction, the data once again confirmed the proportional growth conclusion.
The Data Wall Is Coming, the Marginal Value of Redundant Training Decays Exponentially
All the above discussions assume that "training data is infinite and non-repeating," but high-quality text data is expected to run out between 2026 and 2028. Research shows that the effective value of repeated data decays exponentially, with rapidly diminishing marginal returns per additional training round. Wang Li's interactive simulator embedded in the article vividly demonstrates the sensitivity of engineering details — even slight adjustments in fitting accuracy or noise levels can lead to predictions that are completely off.
Finally, Wang Li wrote this judgment that had been refined over three years: Scaling Laws are not physical laws; they are observational guidelines highly sensitive to engineering details.
