xLLM Community Unveils Open-Source Inference Engine on December 6th: Supports MoE, T2I, T2V Full Scenarios, Achieves Latency Below 20ms with Mooncake Caching Solution

The xLLM community, established for only three months, announced that it will hold its first offline Meetup on December 6th with the theme "Building an Open Source AI Infra Ecosystem." The event will showcase the self-developed inference engine xLLM-Core, and public comparison data: for three types of tasks, MoE, Text-to-Image, and Text-to-Video, the P99 latency on the same GPU is below 20ms, an average of 42% lower than vLLM, and throughput has increased by 2.1 times.

Technical Highlights

Unified Computation Graph: Abstracting language, vision, and video generation into a "Token-in Token-out" graph, enabling multi-modal parallelism with a single engine

Mooncake KV Cache Integration: A hit rate of 99.2% across three storage levels (GPU memory → DDR → NVMe), with cache penetration latency <5ms

Dynamic Shape Batch Processing: Supports online stitching of images from 512×512 to 2048×2048 and videos from 8 to 128 frames, reducing memory fragmentation by 38%

Plugin-based Backend: Compatible with CUDA, ROCm, and MTIA; Apple Silicon and Intel Arc are included in the roadmap for Q1 2026

Key Cases

Professor Yang Hailong from Beihang University will share the JD 11.11 practice at the Meetup: xLLM-Core supports a peak of 40k requests per second, reducing machine costs by 90%, and improving business efficiency by 5 times.

Open Source Plan

The xLLM-Core 0.9 version (Apache 2.0) will be released on-site, including Docker images, Python/C++ APIs, and Benchmark scripts; the community expects to release 1.0 LTS in June 2026, offering long-term maintenance and commercial support.

Registration channels are now open on the xLLM official website, with 300 seats available for offline participation and live streaming online.

xLLM Community Unveils Open-Source Inference Engine on December 6th: Supports MoE, T2I, T2V Full Scenarios, Achieves Latency Below 20ms with Mooncake Caching Solution

Related Recommendations

Google Invests $4 Billion in Building an Artificial Intelligence Data Center in Texas

Meta Executive: AI Investment Frenzy Is Still Within Reasonable Limits and Won't Form a Bubble

Lambda and Microsoft Reach a Billions of Dollars AI Infrastructure Partnership

New Strategy for Oracle's AI Supercluster Construction

Meta and CoreWeave Reach a $14.2 Billion AI Infrastructure Partnership