A paradigm-shifting breakthrough has emerged in the field of AI vision generation. MiniMax and Huazhong University of Science and Technology recently open-sourced their core technology—VTP (Visual Tokenizer Pretraining, Visual Tokenizer Pretraining), achieving a 65.8% improvement in end-to-end image generation performance by optimizing only the visual tokenizer (Visual Tokenizer) without modifying the standard DiT (Diffusion Transformer) architecture. This achievement challenges the industry's conventional belief that "only larger models can improve performance," and for the first time, pushes the visual tokenizer to an unprecedented technical level.

Not touching the main model, but changing the "translator"—performance doubles

Traditional generation models (such as DALL·E3 and Stable Diffusion3) rely on main networks like DiT to enhance performance, while VTP takes a different approach: it focuses on optimizing the visual tokenizer—the "visual translator" responsible for compressing images into discrete token sequences.

The key lies in the fact that VTP does not modify any training process or structure of DiT. It only optimizes the tokenizer during the pre-training phase, making the latent representation it outputs easier to learn and more general, thereby enabling downstream DiT to achieve twice the result with half the effort. Experiments show that, under the same DiT configuration, systems using VTP significantly outperform the baseline in generation quality (metrics such as FID and CLIP Score).

image.png

Establishing the theory framework of "tokenizer scalability" for the first time

VTP's breakthrough is not just an engineering optimization, but also introduces a new theoretical perspective:

- For the first time, it clearly links the learnability of latent representations with the ability to generate general visual representations;

- For the first time, it proves that the tokenizer itself has scalability (tokenizer scaling)—as the capacity, training data, and pre-training strategies of the tokenizer increase, the generation performance shows a clear scaling curve;

- It opens up a new path for performance growth outside of the model: in the future, there may be no need to constantly expand the parameters of DiT, but instead achieve a more cost-effective performance jump through the optimization of the tokenizer.

image.png

Open source empowers, promoting the democratization of visual generation

Currently, the VTP code, pre-trained tokenizer, and training recipe are fully open-sourced and compatible with mainstream DiT implementations. This means that any researcher or company using the DiT architecture can "plug and play" VTP to achieve nearly a 70% improvement in generation quality at low cost, especially benefiting small and medium-sized teams with limited computing power.

AIbase believes that the release of VTP marks a new stage in AI generation technology, characterized by "system-level optimization." As the industry shifts from "the big model theory" to "end-to-end collaboration for efficiency," this collaboration between MiniMax and Huazhong University of Science and Technology is not only a technological victory, but also a strong practice of the "efficient AI" development concept—true innovation sometimes does not lie in building a bigger engine, but in making every part work smarter together.

Code: https://github.com/MiniMax-AI/VTP

Paper: https://arxiv.org/abs/2512.13687v1