In the field of AI, the limitation on the number of Tokens has always been a pressing issue. Recently, a study conducted by a team of Chinese researchers has attracted widespread attention. The research shows that diffusion language models demonstrate three times more data learning potential than autoregressive models under Token number constraints. This finding may open up new possibilities for future language model training.
The core of this study is a diffusion model with a parameter scale of 1 billion, which was trained for 480 epochs using 1 billion Tokens. On the HellaSwag and MMLU benchmark tests, the model achieved accuracy rates of 56% and 33%, respectively, without using any special techniques or data filtering. More surprisingly, even with highly repetitive data training, the model's performance did not show saturation, indicating that it can extract more useful information from the same data.
Researchers analyzed the powerful data learning capabilities of diffusion language models and attributed them to two main reasons. First, diffusion models use bidirectional modeling and diffusion objectives, allowing them to more comprehensively explore information in the data, while traditional autoregressive models have causal limitations when processing data. Second, diffusion models have higher computational density, investing more computing resources during training and inference, optimizing predictions through multiple data processing steps, thereby improving the overall performance of the model.
Although diffusion models show some robustness in reusing data, the research team found that as the number of training epochs increases, the model tends to overfit. However, surprisingly, even in cases of overfitting, the model's performance on downstream tasks does not immediately decline, and sometimes continues to improve. This is because changes in validation loss are not always positively correlated with the accuracy of downstream tasks. When dealing with limited training data, the model may become overly confident about certain text segments.
The findings of this study provide new insights into future AI model training methods, especially under Token number constraints, where the application prospects of diffusion language models will be even broader. The research team plans to use larger models and more diverse data in their upcoming work to further validate these findings.