Meituan LongCat-Flash-Lite Launches: 4.5 Billion Activated Parameters with Performance Comparable to Large Models

Traditional MoE (Mixture of Experts) architectures enhance model capabilities by increasing the number of experts, but often face diminishing marginal returns and high communication costs. Today, Meituan's LongCat team has released a new model LongCat-Flash-Lite, which successfully breaks through performance bottlenecks through a novel paradigm called "Embedding Expansion."

Key Breakthrough: Embedding Expansion Outperforms Expert Expansion

Research by the LongCat team shows that under certain conditions, expanding the embedding layer can achieve a better Pareto front compared to simply increasing the number of experts. Based on this, LongCat-Flash-Lite has a total of 68.5 billion parameters, but due to the use of N-gram embedding layers, only 2.9 billion to 4.5 billion parameters are activated per inference. Among these, over 30 billion parameters are efficiently allocated to the embedding layer, using N-gram to capture local semantics and accurately identify specific scenarios such as "programming commands," significantly improving understanding accuracy.

Vertical Optimization: Full-Chain Evolution from Architecture to System

To transform theoretical sparsity advantages into actual performance, Meituan implemented three system-level optimizations:

Intelligent Parameter Allocation: The embedding layer accounts for 46% of the parameters, avoiding linear growth in computation with parameter scaling by using O(1) lookup complexity.
Specialized Cache and Kernel Fusion: A N-gram Cache mechanism similar to KV Cache was designed, along with customized CUDA kernels (such as AllReduce+RMSNorm fusion), significantly reducing I/O latency.
Speculative Decoding Collaboration: By expanding batch size through 3-step speculative inference, combined with a draft model of the conventional embedding layer, further reducing latency.

Under typical workloads (input 4K, output 1K), the model's API provides a fast generation speed of 500-700 tokens/s, supporting a maximum context length of 256K.

Performance: Intelligent Agents and Code Lead the Way

In multiple authoritative benchmark tests, LongCat-Flash-Lite demonstrates cross-level competitiveness:

Intelligent Agent Tasks: Achieved the highest scores in telecom, retail, and aviation scenarios on $\tau^2$-Bench.
Code Capabilities: SWE-Bench accuracy reached 54.4%, and it far outperformed other models in TerminalBench (terminal command execution) with a score of 33.75.
General Competence: MMLU score of 85.52, comparable to Gemini2.5Flash-Lite; AIME24-level math competition performance is stable.

Currently, Meituan has fully open-sourced model weights, technical reports, and the accompanying inference engineSGLang-FluentLLM. Developers can apply for trial usage viaLongCat API Open Platform, enjoying a free quota of 50 million tokens per day.

Xiaomi Launches New Generation MoE Large Model MiMo-V2-Flash to Support AGI Development

Luofuli, the new head of large models at Xiaomi, officially announced the new MoE large model MiMo-V2-Flash at the 2025 Xiaomi Ecosystem Conference. The model adopts a Hybrid SWA architecture, with a simple and elegant design, and shows outstanding performance in long-context reasoning, marking an important step for Xiaomi toward the goal of Artificial General Intelligence (AGI).

Compact yet Powerful Reasoning Engine! Ring-mini-2.0 Launches with Remarkable Performance Exceeding 10B Models

Today, we officially launched Ring-mini-2.0, a high-performance reasoning MoE model that is deeply optimized based on the Ling-mini-2.0 architecture. Ring-mini-2.0 has a total parameter count of 16B, but in practice, only 1.4B parameters need to be activated to achieve reasoning capabilities equivalent to dense models below the 10B level. This model performs exceptionally well in logical reasoning, programming, and math tasks, and supports a long context of 128K, making it suitable for various applications.

ByteDance's UltraMem Architecture Reduces Large Model Inference Costs by 83%

The ByteDance Doubao large model team announced today the successful development of a new sparse model architecture called UltraMem. This architecture effectively addresses the high memory access issues during the inference of MoE (Mixture of Experts) models, improving inference speed by 2 to 6 times compared to MoE, and reducing inference costs by up to 83%. This groundbreaking advancement opens a new path for efficient inference of large models. The UltraMem architecture successfully resolves the memory bottleneck during inference of MoE architectures while maintaining model performance. Experimental results show that the parameters and activation conditions are the same.