ByteDance Seed's Latest Reinforcement Learning Recipe POLARIS Open Sourced, 4B Model's Mathematical Reasoning Approaches 235B Performance

Recently, the Seed team at ByteDance has collaborated with the University of Hong Kong and Fudan University to introduce an innovative reinforcement learning training method called POLARIS. This method successfully enhances the mathematical reasoning ability of small models to a level comparable to that of large models through a carefully designed Scaling RL strategy, offering a new path for optimizing small models in the field of artificial intelligence.

Experimental results show that the 4 billion parameter open-source model Qwen3-4B trained with POLARIS achieved high accuracy rates of 79.4% and 81.2% on the AIME25 and AIME24 math tests, outperforming some larger closed-source models. Notably, the lightweight design of the POLARIS-4B model allows it to be easily deployed on consumer-grade graphics cards, significantly lowering the application threshold.

WeChat screenshot_20250716105617.png

The core innovation of POLARIS lies in its training strategy. The research team found that customizing training data and hyperparameters around the model being trained can significantly enhance the mathematical reasoning ability of small models. In practice, the team dynamically adjusted the difficulty distribution of the training data, creating a dataset slightly biased towards difficult problems to avoid excessive concentration of sample difficulty. Additionally, they introduced a dynamic data update strategy, removing overly easy samples in real-time based on the model's performance during training to ensure the effectiveness of the training.

In terms of sampling control, POLARIS fine-tunes the sampling temperature to balance model performance and the diversity of generated paths. Research shows that the sampling temperature has a significant impact on model performance and path diversity. Too high or too low a temperature is detrimental to model training. Therefore, the team proposed a temperature initialization method that controls the exploration area and dynamically adjusts the sampling temperature during training to maintain the diversity of generated content.

To address the challenges of long-context training, POLARIS introduces length extrapolation technology. By adjusting the position encoding RoPE, the model can handle longer sequences than those seen during training. This innovative strategy effectively compensates for the shortcomings in long-text training, improving the model's performance in long-text generation tasks.

Additionally, POLARIS adopts a multi-stage RL training method. In the early stages, training is conducted using a shorter context window, and once the model's performance stabilizes, the context window length is gradually increased. This strategy helps the model adapt step by step to more complex reasoning tasks, enhancing the stability and effectiveness of the training.

Currently, the detailed training methods, training data, training code, and experimental models of POLARIS have all been made open source. The research team has verified the effectiveness of POLARIS on multiple mainstream reasoning evaluation sets. The results show that models of different sizes and different model families achieve significant improvements in performance after applying the POLARIS training method.

GitHub Homepage:

https://github.com/ChenxinAn-fdu/POLARIS

Hugging Face Homepage:

https://huggingface.co/POLARIS-Project

Study: Global AI Chipset Market to Exceed $700B with 31.8% CAGR

According to TMR Research, the global artificial intelligence chipset market size is expected to exceed $700 billion, with a compound annual growth rate of 31.8% from 2022 to 2031. The article discusses the development trends, application areas, and key players in the artificial intelligence chipset market, which is highly timely and valuable for readers interested in the artificial intelligence chipset market.

IBM Research: How AI & Automation Protect Businesses from Data Breaches

IBM's report provides sufficient evidence that artificial intelligence, automation, and threat intelligence can address data breaches throughout the lifecycle, reduce costs, and provide stronger evidence. The research found that integrating artificial intelligence and automation into security operations teams can reduce the lifecycle of data breaches by 33% and costs by 33.6%. However, currently, only 28% of enterprises widely apply artificial intelligence and automation. Many enterprises rely on legacy systems, which are easily bypassed by attackers. The significance of this article lies in emphasizing the effectiveness of artificial intelligence and automation in improving cybersecurity and calling on enterprises to widely adopt these technologies to protect data security.

Google's AGI Robot Breakthrough: 54 - Member Team's 7 - Month Work, High Generalization and Reasoning 解释：核心关键词为“谷歌AGI机器人”（Google's AGI Robot）和“新成果”（Breakthrough），标题简洁地概括了主要内容，以动词开头，符合英文习惯，且长度在规定范围内。

The robotics research team at Google DeepMind recently released a robotics project called RT-2. This project took 7 months to develop and uses a large model for training. RT-2 has capabilities such as symbol understanding, reasoning, and human recognition, and can think and complete tasks based on human instructions. By combining the large model with the robot's operational capabilities, RT-2 can accomplish tasks that involve logical leaps, such as from 'extinct animals' to 'plastic dinosaurs'. The results of this project performed well in various sub - category tests, with performance up to three times that of the previous generation of robot models. This research result demonstrates the potential of large models in robotics research and is expected to drive the development of robots in the future.

RWKV: Small Team Aims to Be Android of AI Era with Big Model

Meta Intelligence OS is a startup founded by Bloomberg. It has developed a series of large models based on the open-source model RWKV and aims to become the Android in the era of large models. The RWKV model has superior performance and low cost in inference tasks, thus attracting customers from industries such as finance, law firms, and smart hardware. The business model of Meta Intelligence OS is model customization based on private data and internal AI Agent development. The company hopes to solve the problems of API call latency and data security by deploying large models on terminal devices. Currently, RWKV versions are available on Windows, Mac, and Linux computers, and Android and iOS versions are also in development. Meta Intelligence OS is raising funds and collaborating with chip companies and computing power platforms to create benchmark customers. Luo Xuan said that the decisive battlefield for large models is on hardware, and both terminal devices and the cloud require dedicated chips.

ChatExcel New Desktop Version Released: AI Smart Prompt Function Enhances Data Processing

ChatExcel launches the desktop version, supporting Mac and Windows systems, enabling local Excel processing and improving work efficiency. The new AI prompt optimization feature automatically standardizes instruction expressions with a magic pen and supports saving frequently used prompts for reuse. Currently, it has over 1 million users, and the product is updated weekly, continuously optimizing user experience. Key highlights: 1) No need for browser operation on the desktop; 2) Intelligent optimization of AI interaction prompts; 3) Prompt collection function for frequently used prompts.

ByteDance Seed's Latest Reinforcement Learning Recipe POLARIS Open Sourced, 4B Model's Mathematical Reasoning Approaches 235B Performance

Related Recommendations

Study: Global AI Chipset Market to Exceed $700B with 31.8% CAGR

IBM Research: How AI & Automation Protect Businesses from Data Breaches

RWKV: Small Team Aims to Be Android of AI Era with Big Model

ChatExcel New Desktop Version Released: AI Smart Prompt Function Enhances Data Processing