The Baoling large model series under Ant Group has received a major update today, with Ling-2.6-flash officially open to developers worldwide. To adapt to different hardware environments and reduce the deployment threshold, this model also launched multiple precision versions including BF16, FP8, and INT4, aiming to provide developers with more flexible inference options.

As an Instruct model with a total parameter count of 104B and an activated parameter count of 7.4B, Ling-2.6-flash was previously tested under the anonymous identity "Elephant Alpha" on the OpenRouter platform. During a two-week trial period, the development team collected a large amount of real feedback and made targeted optimizations, significantly improving the smoothness of Chinese-English natural switching and enhancing its compatibility with mainstream programming frameworks.

image.png

Technical Highlights: Hybrid Architecture and Extreme Efficiency

Ling-2.6-flash's core competitiveness lies in its unique architecture design and high operational efficiency:

  • Hybrid Linear Architecture: Through underlying computational optimization, the model demonstrates excellent inference speed. With 4 H20 cards, its inference speed can reach up to 340 tokens/s. In the Prefill (pre-fill) throughput metric, it reached 2.2 times that of Nemotron-3-Super, significantly reducing response latency.

  • Outstanding "Smart Efficiency Ratio": The development team conducted in-depth calibration of token efficiency during training. Evaluation data shows that for tasks of the same quality, Ling-2.6-flash only consumes about 15M tokens, which is one-tenth of that of similar competitors, greatly reducing commercial costs.

Scenario Deepening: Targeted Enhancement of Agent Capabilities

For the most widely used agent (intelligent entity) scenarios in large models, Ling-2.6-flash has undergone specialized enhancement. Whether in complex tool calls, multi-step logical planning, or final task execution, the model performs stably. In several industry-standard evaluations such as BFCL-V4 and SWE-bench, even when facing models with larger activated parameter scales, Ling-2.6-flash can maintain comparable or even state-of-the-art (SOTA) levels.

Currently, developers can access the open-source resources of this model through Hugging Face and ModelScope (Moba Community), further exploring its potential in various industry applications.