Why Is the Large Model Not Found? MiniMax's Deep Review of the Technical Truth Behind Ma Jiaqi's Disappearance

Recently, the developer community found that the MiniMax M2 series model exhibited anomalies when outputting the specific name "Ma Jiaqi". MiniMax immediately conducted a full-chain investigation and released a technical report revealing the underlying mechanism behind this phenomenon: low-frequency Token degradation caused by post-training.

Root Cause: The "Squeezed" Token

The analysis showed that the tokenizer split "Ma Jiaqi" into ['Ma', 'Jiaqi']. Since "Jiaqi" appeared frequently during pre-training, it was merged into an independent Token (ID 190467). However, in the "post-training" stage, which determines the model's conversational ability, there were fewer than five samples containing this Token.

This extremely low frequency caused the Token to remain unoptimized in the vector space, being pushed away by frequently updated high-frequency Tokens (such as code symbols and tool call markers). Eventually, the model retained knowledge about Ma Jiaqi but lost the ability to output the corresponding Token, instead choosing similar-sounding alternatives like "Jiaqi" or "Qiqi."

Chain Reaction: Forgetting Japanese and Junk Words

After scanning a 200,000-word vocabulary list, MiniMax found that approximately 4.9% of the Tokens showed significant degradation. The most severely degraded were Japanese content (degradation rate of 29.7%), which explains why the model occasionally mixes Russian or Korean characters in Japanese conversations — because Japanese Tokens experienced parameter drift, causing confusion with other languages in the space.

In addition, the degradation list included a large number of Internet SEO junk words (such as "private server", "painless abortion", etc.). Since these words almost never appear in conversation data, the model gradually "forgot" them during post-training.

Solution: Establish a "Minimum Generation Frequency"

To address this issue, MiniMax proposed three core repair strategies:

Full Vocabulary Synthetic Data: Construct repetition tasks to ensure each Token has a minimum level of practice frequency during the post-training phase. Currently, the Japanese confusion rate has dropped from 47% to 1%, and the stability of the entire vocabulary list's parameters has significantly improved.
Injecting Pre-training Corpus: Introduce pre-training corpus proportionally into SFT data to use its breadth to alleviate forgetting.
Vocabulary Trimming and Monitoring: Remove redundant Tokens that are never used and include Token coverage in the post-training quality monitoring metrics.

Summary:

MiniMax Large Model Mispronounces Names - Xiyu Technology: Insufficient Training After Specific Tokens

A technical report from XiYu Technology reveals that the M2 series model fails to accurately output specific names like 'Ma Jiaqi' due to a 'token offset' issue caused by the tokenizer. The model splits the name into 'Ma' and 'Jiaqi', compressing the vector space and causing recognition bias. This exposes a common yet subtle flaw in current large model training, affecting precise generation of specific names.....

Unveiling the Mystery of MiniMax M2: Why Choose Full Attention Mechanism?

The MiniMax M2 model uses a full attention mechanism, abandoning linear or sparse attention techniques. The development team believes that although the latter can save computing resources, full attention is more efficient in industrial applications and can improve model performance. This decision aims to optimize actual deployment results and promote the development of AI technology.

Liang Wenfeng Invests 20 Billion! DeepSeek Launches Record-Breaking $5 Billion Funding Round, V4.1 Set for June

The domestic large model sector has witnessed a capital storm, with DeepSeek (Deep Seek) initiating its first major funding round targeting an amount of 5 billion yuan, which, if successful, would break the industry record. Notably, the lead investor is not a venture capital firm or an internet giant, but the founder Liang Wenfeng himself, who has invested the highest personal amount, demonstrating his strong confidence in the company.

DeepSeek Opens Wide-Ranged Image Recognition Mode: Multimodal Understanding Function Enters Internal Testing

DeepSeek launched a large-scale beta of its 'Image Recognition Mode' on May 9, marking its entry into multimodal interaction. After a limited gray test in late April, most accounts can now access it via a standalone entry in the chat interface. Though still in beta, it's listed alongside 'Fast Mode' and 'Expert Mode,' signaling multimodal understanding as a core direction.....