The xLLM community, established for only three months, announced that it will hold its first offline Meetup on December 6th with the theme "Building an Open Source AI Infra Ecosystem." The event will showcase the self-developed inference engine xLLM-Core, and public comparison data: for three types of tasks, MoE, Text-to-Image, and Text-to-Video, the P99 latency on the same GPU is below 20ms, an average of 42% lower than vLLM, and throughput has increased by 2.1 times.

Technical Highlights  

Unified Computation Graph: Abstracting language, vision, and video generation into a "Token-in Token-out" graph, enabling multi-modal parallelism with a single engine  

Mooncake KV Cache Integration: A hit rate of 99.2% across three storage levels (GPU memory → DDR → NVMe), with cache penetration latency <5ms  

Dynamic Shape Batch Processing: Supports online stitching of images from 512×512 to 2048×2048 and videos from 8 to 128 frames, reducing memory fragmentation by 38%  

Plugin-based Backend: Compatible with CUDA, ROCm, and MTIA; Apple Silicon and Intel Arc are included in the roadmap for Q1 2026

Key Cases  

Professor Yang Hailong from Beihang University will share the JD 11.11 practice at the Meetup: xLLM-Core supports a peak of 40k requests per second, reducing machine costs by 90%, and improving business efficiency by 5 times.

Open Source Plan  

The xLLM-Core 0.9 version (Apache 2.0) will be released on-site, including Docker images, Python/C++ APIs, and Benchmark scripts; the community expects to release 1.0 LTS in June 2026, offering long-term maintenance and commercial support.

Registration channels are now open on the xLLM official website, with 300 seats available for offline participation and live streaming online.