According to the latest report, Tri Dao, one of the co-authors of Flash Attention, together with two Ph.D. students from Princeton University, has launched a new kernel library called QuACK. Notably, they developed it using only Python and CuTe-DSL, without any CUDA C++ code. This innovation not only breaks traditional programming frameworks but also achieves a 33%-50% speed improvement over libraries like torch.compile and Liger on powerful H100 GPUs.
Tri Dao stated that achieving efficient operation for memory-intensive kernels is not a secret that is difficult to achieve, but rather relies on precise handling of some key details. He emphasized that understanding the thread and memory hierarchy structures of modern accelerators is crucial. As the deepening of GPU performance optimization, developers can achieve significant performance improvements in a more friendly environment by using CuTe-DSL, a Python-based domain-specific language.
This achievement has quickly attracted attention from many industry experts. Vijay, a senior architect from NVIDIA's CUTLASS team, praised this work and emphasized that the design of CuTe-DSL allows experts like Tri Dao to easily achieve efficient GPU operation. He also revealed that there will be more exciting content released this year. At the same time, Horace He, a member of the PyTorch team, also expressed great interest in this innovation, especially highlighting its significant advantages for long sequence processing.
To benefit more developers, the authors of QuACK have written a detailed tutorial introducing the specific steps and code, making it easy for everyone to use directly. The article emphasizes that to achieve efficient operation in GPU model training and inference, both compute-intensive kernels and memory-intensive kernels need to be optimized. In previous work, the optimization of matrix multiplication and attention mechanisms has been very mature, so this research focuses on memory-intensive kernels.
The authors explained that memory-intensive kernels have a low arithmetic intensity, so throughput depends more on the amount of data transferred per second. By cleverly utilizing the memory hierarchy structure and hardware features of the GPU, the authors successfully improved the performance of memory-intensive kernels to nearly "lightning speed" levels.