Recently, MiniMax released a technical report that thoroughly analyzed the deep reasons why its M2 series models failed to accurately output specific names like "Ma Jiaqi." This seemingly accidental error actually revealed a hidden defect common in current large model training.
Token Offset: The Squeezed Vector Space
The core issue stems from the basic unit of text processing in large models—the tokenizer. Taking "Ma Jiaqi" as an example, the name is split into two tokens inside the model: "Ma" and "Jiaqi." Although these words have been learned by the model during the pre-training phase on massive data, problems emerged during the subsequent "fine-tuning" (instruction fine-tuning) stage.

Because the term "Jiaqi" appears very rarely in the selected dialogue data used for fine-tuning, this token was almost in a zero-training state. At the same time, high-frequency tokens such as code symbols and tool calls were continuously reinforced during training, constantly "squeezing" the living space of low-frequency tokens. Eventually, these low-frequency tokens deviated from the correct probability range, leading the model to resort to similar-sounding alternatives like "Jiaqi" or "Qiqi" when attempting to refer to specific celebrities.
Beyond Chinese: Chain Reaction of Japanese Mixed with Russian
MiniMax's investigation showed that this "token degradation" phenomenon is not an isolated case. After scanning approximately 200,000 tokens in the full vocabulary, it was found that about 4.9% of the tokens showed significant performance decline. Among them, the degradation rate of Japanese tokens reached as high as 29.7%, which is also the root cause of the model occasionally mixing Russian or Korean characters into Japanese conversations.

Aside from names and foreign languages, affected tokens also include LaTeX formula markers, Wikipedia source code symbols, and even some SEO spam keywords. This finding proves that the consequences of data sparsity are global: when post-training data cannot evenly cover different languages and specific vocabulary, the model's generation logic will develop biases.
Systematic Repair: Establishing a "Minimum Guarantee" for 200,000 Tokens
To address this structural problem, the R&D team adopted a precise repair solution. They created synthetic data covering the entire vocabulary and forced the model to perform "repetition" tasks, thereby establishing a "minimum guarantee" for the generation frequency of each token.
Data after the repair shows a significant improvement in the stability of the model's full vocabulary output, and the proportion of foreign characters mixed into Japanese responses dropped from 47% to just 1%. Currently, the team is still exploring more in-depth optimization solutions, such as mixing pre-training materials into the fine-tuning stage or directly cleaning up redundant markers no longer used in the vocabulary.
This incident has sparked deep reflection in the industry: tokenizers of large models are often based on broad web corpora, but downstream application scenarios vary widely. How to ensure data coverage at the token level from a fundamental statistical perspective while pursuing semantic diversity will be a key challenge in improving the reliability of large models in the future.
