Traditional MoE (Mixture of Experts) architectures enhance model capabilities by increasing the number of experts, but often face diminishing marginal returns and high communication costs. Today, Meituan's LongCat team has released a new model LongCat-Flash-Lite, which successfully breaks through performance bottlenecks through a novel paradigm called "Embedding Expansion."

QQ20260206-155117.png

Key Breakthrough: Embedding Expansion Outperforms Expert Expansion

Research by the LongCat team shows that under certain conditions, expanding the embedding layer can achieve a better Pareto front compared to simply increasing the number of experts. Based on this, LongCat-Flash-Lite has a total of 68.5 billion parameters, but due to the use of N-gram embedding layers, only 2.9 billion to 4.5 billion parameters are activated per inference. Among these, over 30 billion parameters are efficiently allocated to the embedding layer, using N-gram to capture local semantics and accurately identify specific scenarios such as "programming commands," significantly improving understanding accuracy.

QQ20260206-155453.png

Vertical Optimization: Full-Chain Evolution from Architecture to System

To transform theoretical sparsity advantages into actual performance, Meituan implemented three system-level optimizations:

  1. Intelligent Parameter Allocation: The embedding layer accounts for 46% of the parameters, avoiding linear growth in computation with parameter scaling by using O(1) lookup complexity.

  2. Specialized Cache and Kernel Fusion: A N-gram Cache mechanism similar to KV Cache was designed, along with customized CUDA kernels (such as AllReduce+RMSNorm fusion), significantly reducing I/O latency.

  3. Speculative Decoding Collaboration: By expanding batch size through 3-step speculative inference, combined with a draft model of the conventional embedding layer, further reducing latency.

Under typical workloads (input 4K, output 1K), the model's API provides a fast generation speed of 500-700 tokens/s, supporting a maximum context length of 256K.

Performance: Intelligent Agents and Code Lead the Way

In multiple authoritative benchmark tests, LongCat-Flash-Lite demonstrates cross-level competitiveness:

  • Intelligent Agent Tasks: Achieved the highest scores in telecom, retail, and aviation scenarios on $\tau^2$-Bench.

  • Code Capabilities: SWE-Bench accuracy reached 54.4%, and it far outperformed other models in TerminalBench (terminal command execution) with a score of 33.75.

  • General Competence: MMLU score of 85.52, comparable to Gemini2.5Flash-Lite; AIME24-level math competition performance is stable.

Currently, Meituan has fully open-sourced model weights, technical reports, and the accompanying inference engineSGLang-FluentLLM. Developers can apply for trial usage viaLongCat API Open Platform, enjoying a free quota of 50 million tokens per day.