On May 28, the NVIDIA research team officially open-sourced a reinforcement learning training framework named Polar. The core innovation of this framework lies in its ability to seamlessly integrate existing mainstream code agents (Agent) such as Codex, Claude Code, and Qwen Code into GRPO (Generalized Relative Policy Optimization) reinforcement learning training without modifying any original code.

I. Industry Pain Points: The "Wall" of Agent Reinforcement Learning
As code agents move from simple single-step tasks to complex long-process tasks (such as warehouse-level modifications and OS interactions), developers increasingly rely on mature execution frameworks (Harness). However, integrating these complex frameworks into traditional reinforcement learning infrastructure faces significant barriers:
High Integration Cost: Traditional methods require rewriting code logic into standard environment interfaces like env.init(), env.step(), which is extremely tedious.
Information Loss: During the refactoring process, key details such as tool calls, multi-turn dialogue context, or sub-agent collaboration logic are often lost, causing the model to fail to obtain high-quality training signals.

II. Core Solution: Treat the "Boundary" as the Training Entry Point
Polar does not require rewriting the execution framework, but instead treats the "model API boundary" as the entry point for training.
Black-box Processing: Polar sets up a transparent proxy (Gateway) between the code execution framework and the model inference server. No matter whether the agent uses the API interface from Anthropic, OpenAI, or Google, Polar can seamlessly intercept and forward requests.
Trace Reconstruction: During forwarding, Polar records key information such as prompts, sampled Tokens, and log probabilities in real-time, and rebuilds it into "trace" data required by the reinforcement learning trainer.
Efficient Asynchronous Architecture: The system uses Rollout Server for scheduling and persistence, while Gateway Node handles lifecycle and resource recycling. By using a preheated buffer (READY buffer) and parallel task processing, it effectively eliminates long-tail tasks that block GPU training.
III. Performance Leap: Transforming Code Agents
Experimental data shows that Polar combined with GRPO training brings significant performance improvements:
SWE-Bench Verified Benchmark Test: Using the same Qwen3.5-4B base model, performance varies across different code frameworks:
Codex Framework: pass@1 score jumps from 3.8% to 26.4% (a surge of 594.74%).
Claude Code Framework: from 29.8% to 34.6%.
Pi Framework: from 34.2% to 40.4%.
Extreme Efficiency: After introducing the prefix_merging strategy, compared to the traditional per_request mode, the training wall clock time is shortened by about 5.39 times, and GPU utilization rises from 20.4% to 87.7%.
Industry Commentary
The open-sourcing of NVIDIA's Polar essentially builds a "highway" for "AI agents" toward reinforcement learning training. It not only allows researchers to efficiently train using massive open-source code frameworks, but also lowers the GPU computing barrier through system-level optimization.
With the popularity of Polar, developers no longer need to worry about "how to adapt models to training frameworks." In the future, the evolution path of AI coding agents will become more standardized and efficient. This marks a shift in AI agent training from manual tuning in laboratories to large-scale, systematic engineering production.
Paper URL: https://arxiv.org/pdf/2605.24220
