Recently, the new work of Xie Saining's team, "iREPA," originated from a four-month-long Twitter debate. Although this debate ended with Xie Saining's concession, it unexpectedly gave rise to an important paper that presented a novel research approach.
The incident can be traced back to August. At that time, a netizen posted an opinion on Twitter about self-supervised learning (SSL) models, suggesting that they should focus on dense tasks because these tasks rely on spatial and local information in images, not just global classification performance. Xie Saining refuted this view, arguing that global performance and dense tasks are not directly related.
Netizens engaged in an enthusiastic discussion, during which one person also shared a method that could be compared with REPA. This discussion sparked Xie Saining's interest and prompted him to explore the issue in depth. After several months, Xie Saining stated that his previous view had been revised, and the research in this paper provided a new perspective for understanding the generative capabilities of visual encoders.
In this paper, researchers explored which parts of pre-trained visual encoders determine the performance of generative models. The results showed that spatial structure information, rather than global semantics, is the key factor driving the quality of generation. The traditional view holds that better global semantic information can improve generation effects, but the study shows that visual encoders with lower accuracy often achieve better generation performance.
To address this issue, the researchers proposed iREPA, a new framework that can be integrated into any representation alignment method with just three lines of code. By modifying PA, such as replacing the traditional MLP projection layer with a convolutional layer, the researchers successfully enhanced the spatial structure information, significantly improving the generation performance.
This academic discussion not only demonstrated an open and collaborative research atmosphere but also emphasized the importance of acquiring knowledge through communication and experimentation.
