No CUDA Code Needed! H100 Accelerates 33%-50% Flash Attention Author's New Work Sparks Controversy

AIbase基地

Published in AI News · 4 minute read · Jul 11, 2025

According to the latest report, Tri Dao, one of the co-authors of Flash Attention, together with two Ph.D. students from Princeton University, has launched a new kernel library called QuACK. Notably, they developed it using only Python and CuTe-DSL, without any CUDA C++ code. This innovation not only breaks traditional programming frameworks but also achieves a 33%-50% speed improvement over libraries like torch.compile and Liger on powerful H100 GPUs.

Tri Dao stated that achieving efficient operation for memory-intensive kernels is not a secret that is difficult to achieve, but rather relies on precise handling of some key details. He emphasized that understanding the thread and memory hierarchy structures of modern accelerators is crucial. As the deepening of GPU performance optimization, developers can achieve significant performance improvements in a more friendly environment by using CuTe-DSL, a Python-based domain-specific language.

This achievement has quickly attracted attention from many industry experts. Vijay, a senior architect from NVIDIA's CUTLASS team, praised this work and emphasized that the design of CuTe-DSL allows experts like Tri Dao to easily achieve efficient GPU operation. He also revealed that there will be more exciting content released this year. At the same time, Horace He, a member of the PyTorch team, also expressed great interest in this innovation, especially highlighting its significant advantages for long sequence processing.

To benefit more developers, the authors of QuACK have written a detailed tutorial introducing the specific steps and code, making it easy for everyone to use directly. The article emphasizes that to achieve efficient operation in GPU model training and inference, both compute-intensive kernels and memory-intensive kernels need to be optimized. In previous work, the optimization of matrix multiplication and attention mechanisms has been very mature, so this research focuses on memory-intensive kernels.

The authors explained that memory-intensive kernels have a low arithmetic intensity, so throughput depends more on the amount of data transferred per second. By cleverly utilizing the memory hierarchy structure and hardware features of the GPU, the authors successfully improved the performance of memory-intensive kernels to nearly "lightning speed" levels.

Tencent Hunyuan-A13B Model API Launches

Recently, Tencent Cloud officially launched the API service for the Tencent Hunyuan A13B model on its official website. The input price is set at 0.5 yuan per million Tokens, and the output price is 2 yuan per million Tokens, which has quickly sparked enthusiastic discussions in the developer community. As the first 13B-level MoE (Mixture of Experts) open-source hybrid inference model in the industry, Hunyuan-A13B features a total of 80B parameters and only 13B activated parameters, achieving performance comparable to leading open-source models of the same architecture, while also demonstrating efficient reasoning capabilities.

Google DeepMind Open Sources GenAI Processors: One-Click Building of Real-Time AI Workflows

Google DeepMind open sources the GenAI Processors Python library, helping developers build efficient generative AI workflows. The library supports asynchronous processing of multimodal data and optimizes Gemini API application development, significantly reducing latency in real-time applications. Core features include a modular Processor interface, streaming API design, and concurrency optimization, enabling rapid development of real-time applications such as intelligent assistants. Currently only supports Python, but with an open community contribution model, future plans include expanding functionality to cover more scenarios.

Manus AI Official Website and Social Media Undergo Changes, Chinese Users May Be Affected

General AI company Manus adjusts its China operations, lays off employees, and relocates its core technology team to Singapore. The China region had approximately 120 employees, and the company states this move is aimed at improving operational efficiency and focusing on core business. The official website now shows that the region is unavailable, replacing previous messages about the development of the Chinese version. The official Weibo and Xiaohongshu accounts have also been cleared, indicating a significant shift in the company's market strategy in China.

Modo AI Launches: Input Your Idea and Generate a High-Fidelity, Editable Prototype in 30 Seconds

Modo AI introduces a 30-second rapid prototype generation feature, supporting multi-device adaptation and conversation optimization. Users can generate high-fidelity, editable prototypes through text, sketches, and other input methods, and support iterative conversation adjustments. The AI can intelligently parse uploaded sketches, wireframes, and more, automatically generating interfaces. It offers dual-mode editing, automatic documentation generation, and code integration features, covering multiple scenarios such as e-commerce and social networking, significantly lowering the barrier to prototype creation and improving product design efficiency.

Mistral AI Releases Devstral2507: Designed for Code-Centric Language Modeling

Mistral AI launched the Devstral2507 series with two AI models: the open-source Devstral Small1.1 (24 billion parameters, SWE-Bench score of 53.6%) and the enterprise version Devstral Medium2507 (score of 61.6%). Small1.1 supports a 128k context window and local deployment, while Medium2507 outperforms some commercial models. Both are optimized for code reasoning and program synthesis, and support integration with agent frameworks.