AMD has officially released the vLLM-ATOM plugin, specifically designed for deploying large language models. The plugin aims to significantly optimize the inference performance of mainstream domestic large models such as DeepSeek-R1 and Kimi-K2 on AMD hardware without changing the existing workflow.

As an open-source inference framework designed for high-concurrency scenarios, vLLM is well-known for its high memory utilization. The plugin launched by AMD provides a more tailored optimization solution for its Instinct series GPUs, ensuring developers can achieve technical migration with minimal learning costs.

image.png

Achieve Smooth Performance Upgrades

The core advantage of the vLLM-ATOM plugin lies in its "zero-cost" deployment. Users do not need to modify their existing APIs or end-to-end workflows. The plugin automatically takes over and optimizes request scheduling and kernel tuning in the background, allowing existing services to smoothly migrate to the AMD hardware backend.

From an architectural perspective, the plugin is divided into three layers: the top layer is responsible for compatibility with the OpenAI interface, the middle layer handles model implementation and routing, and the bottom layer provides the core GPU kernels. This structure effectively integrates mixed-expert models (MoE) and quantization technologies, ensuring support for large-scale deployment.

Wide Compatibility with Computing Ecosystems

The plugin focuses on AMD's Instinct MI350 and MI400 series high-performance GPUs. It not only supports mainstream Chinese large language models such as Qwen3 and GLM, but also fully covers various application scenarios, including dense models, mixed-expert models, and vision-language models (VLM).