China's large model competition has seen a new breakthrough, with Step Zenith officially releasing and open-sourcing its new Flash model - Step 3.7 Flash today. This model is specifically designed for the industrialization of agents, and it has been deeply optimized at the system level for code writing, online search, and multimodal workflows.

In terms of architecture, the model adopts an advanced sparse mixture-of-experts (MoE) architecture, with a total parameter count of 196B. With this innovative architecture, the model can generate up to 400 Tokens per second, significantly reducing waiting delays in high-frequency, multi-turn dialogues.

image.png

Native Multimodal and Search Enhancement

Step 3.7 Flash has powerful native multimodal understanding capabilities, allowing it to directly recognize and parse complex visual information such as UI interfaces, charts, and documents. This enables the model to quickly convert visual content into structured data, and even directly generate corresponding execution code.

At the same time, the new model has significantly enhanced its online retrieval and image search capabilities. It can actively obtain multi-source evidence across text and images in an open network environment, and ensure the accuracy of the obtained information through cross-comparison.

Highly Reliable Orchestration and Ecosystem Compatibility

To meet the needs of complex tasks, the model demonstrates extremely high stability in long-chain, multi-turn agent workflows. It can smoothly drive APIs, browsers, terminals, and Office tools, effectively reducing the probability of task execution failure or deviation.

Additionally, the model has undergone deep compatibility optimization for currently mainstream agent development frameworks and tool invocation protocols. This move significantly lowers the access threshold for developers, making workflow orchestration and ecosystem deployment more efficient and convenient.