GPU kernels are central to modern compute stacks and directly determine training and inference efficiency. Kernel development is difficult because it requires hardware expertise and iterative refinement with multi step tool feedback. Since Stanford released KernelBench in February 2025, the LLM4Kernel field has grown rapidly, with increasing interest in using large language models to support or automate kernel generation, optimization, and verification.
This project provides a continuous and comprehensive survey of the field, covering both benchmarks and methods. On the methodological side, we categorize existing work into three major directions:
- Agent based pipelines
- Domain specific Models
- Agentic RL
We include all relevant top conference papers, arXiv preprints, open source projects, technical reports, and blogs, aiming to build the most complete resource hub for LLM4Kernel research.
-
KernelBench: Can LLMs Write Efficient GPU Kernels?
- Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher Ré, Azalia Mirhoseini
- Institution: Stanford University
- Task: Torch -> CUDA
-
TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators
- Jianling Li, ShangZhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, WangHaojie WangHaojie, Jianrong Wang, Xu Han, Zhiyuan Liu, Maosong Sun
- Institution: Tianjin University, Tsinghua University
- Task: Torch | NL -> Triton
-
ComputeEval: Evaluating Large Language Models for CUDA Code Generation
- Institution: NVIDIA
- Task: NL -> CUDA
-
BackendBench: An Evaluation Suite for Testing How Well LLMs and Humans Can Write PyTorch Backends
- Institution: Meta
- Task: Torch -> CUDA | Triton
-
MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation
- Zhongzhen Wen, Yinghui Zhang, Zhong Li, Zhongxin Liu, Linna Xie, Tian Zhang
- Institution: Nanjing University
- Task: Torch -> CUDA | Pallas | AscendC
-
robust-kbench: Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization
- Robert Tjarko Lange, Qi Sun, Aaditya Prasad, Maxence Faldor, Yujin Tang, David Ha
- Institution: Sakana AI
- Task: Torch -> CUDA
-
gpuFLOPBench: Counting Without Running: Evaluating LLMs’ Reasoning About Code Complexity
- Gregory Bolet, Giorgis Georgakoudis, Konstantinos Parasyris, Harshitha Menon, Niranjan Hasabnis, Kirk W. Cameron, Gal Oren
- Institution: Stanford University
- Task: CUDA -> FLOPs
-
STARK: Strategic Team of Agents for Refining Kernels
- Juncheng Dong, Yang Yang, Tao Liu, Yang Wang, Feng Qi, Vahid Tarokh, Kaushik Rangadurai, Shuang Yang
- Institution: Meta Ranking AI Research
- Task: Torch -> CUDA
-
QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach
- Shouyang Dong, Yuanbo Wen, Jun Bi, Di Huang, Jiaming Guo, Jianxing Xu, Ruibai Xu, Xinkai Song, Yifan Hao, Xuehai Zhou, Tianshi Chen, Qi Guo, Yunji Chen
- Institution: University of Science and Technology of China, Cambricon Technologies, Institute of Computing Technology, Institute of Software
- Task: CUDA <-> BangC <-> Hip <-> VNNI
-
QiMeng-Attention: SOTA Attention Operator is generated by SOTA Attention Algorithm
- Qirui Zhou, Shaohui Peng, Weiqiang Xiong, Haixin Chen, Yuanbo Wen, Haochen Li, Ling Li, Qi Guo, Yongwei Zhao, Ke Gao, Ruizhi Chen, Yanjun Wu, Chen Zhao, Yunji Chen
- Institution: Institute of Software, Institute of Computing Technology
- Task: NL -> CUDA (Attention)
-
QiMeng-TensorOp: Automatically Generating High-Performance Tensor Operators with Hardware Primitives
- Xuzhi Zhang, Shaohui Peng, Qirui Zhou, Yuanbo Wen, Qi Guo, Ruizhi Chen, Xinguo Zhu, Weiqiang Xiong, Haixin Chen, Congying Ma, Ke Gao, Chen Zhao, Yanjun Wu, Yunji Chen, Ling Li
- Institution: Institute of Computing Technology, Institute of Software
- Task: NL -> Hardware-specific Tensor Operators (RISC-V, ARM, GPU)
-
QiMeng-GEMM: Automatically Generating High-Performance Matrix Multiplication Code by Exploiting Large Language Models
- Qirui Zhou, Yuanbo Wen, Ruizhi Chen, Ke Gao, Weiqiang Xiong, Ling Li, Qi Guo, Yanjun Wu, Yunji Chen
- Institution: Institute of Computing Technology, Institute of Software
- Task: NL -> CUDA (GEMM)
-
AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs
- Shangzhan Li, Zefan Wang, Ye He, Yuxuan Li, Qi Shi, Jianling Li, Yonggang Hu, Wanxiang Che, Xu Han, Zhiyuan Liu, Maosong Sun
- Institution: Tsinghua University
- Task: Torch -> Triton
-
QiMeng-MuPa: Mutual-Supervised Learning for Sequential-to-Parallel Code Translation
- Changxin Ke, Rui Zhang, Shuo Wang, Li Ding, Guangli Li, Yuanbo Wen, Shuoming Zhang, Ruiyuan Xu, Jin Qin, Jiaming Guo, Chenxi Wang, Ling Li, Qi Guo, Yunji Chen
- Institution: Institute of Computing Technology
- Task: C -> CUDA
-
QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation
- Xinguo Zhu, Shaohui Peng, Jiaming Guo, Yunji Chen, Qi Guo, Yuanbo Wen, Hang Qin, Ruizhi Chen, Qirui Zhou, Ke Gao, Yanjun Wu, Chen Zhao, Ling Li
- Institution: Institute of Software, Institute of Computing Technology
- Task: Torch -> Triton
-
Kevin: Multi-Turn RL for Generating CUDA Kernels
- Carlo Baronio, Pietro Marsella, Ben Pan, Simon Guo, Silas Alberti
- Institution: Stanford University
- Task: Torch -> CUDA
Feel free to open an issue or submit a pull request to correct errors or add work that has not yet been included in this project. You can also email us at [email protected] for any form of discussion and collaboration.
If you find this work useful, welcome to cite us.
@article{llm4kernel,
title={LLM4Kernel: A Survey of Large Language Models for GPU Kernel Development},
author={Changxin Ke},
year={2025}
url={https://github.com/kcxain/Awesome-LLM4Kernel}
}