Awesome-LLM4Kernel

GPU kernels are central to modern compute stacks and directly determine training and inference efficiency. Kernel development is difficult because it requires hardware expertise and iterative refinement with multi step tool feedback. Since Stanford released KernelBench in February 2025, the LLM4Kernel field has grown rapidly, with increasing interest in using large language models to support or automate kernel generation, optimization, and verification.

This project provides a continuous and comprehensive survey of the field, covering both benchmarks and methods. On the methodological side, we categorize existing work into three major directions:

Agent based pipelines
Domain specific Models
Agentic RL

We include all relevant top conference papers, arXiv preprints, open source projects, technical reports, and blogs, aiming to build the most complete resource hub for LLM4Kernel research.

📖 Benchmarks

KernelBench: Can LLMs Write Efficient GPU Kernels?
- Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher Ré, Azalia Mirhoseini
- Institution: Stanford University
- Task: Torch -> CUDA
TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators
- Jianling Li, ShangZhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, WangHaojie WangHaojie, Jianrong Wang, Xu Han, Zhiyuan Liu, Maosong Sun
- Institution: Tianjin University, Tsinghua University
- Task: Torch | NL -> Triton
ComputeEval: Evaluating Large Language Models for CUDA Code Generation
- Institution: NVIDIA
- Task: NL -> CUDA
BackendBench: An Evaluation Suite for Testing How Well LLMs and Humans Can Write PyTorch Backends
- Institution: Meta
- Task: Torch -> CUDA | Triton
MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation
- Zhongzhen Wen, Yinghui Zhang, Zhong Li, Zhongxin Liu, Linna Xie, Tian Zhang
- Institution: Nanjing University
- Task: Torch -> CUDA | Pallas | AscendC
robust-kbench: Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization
- Robert Tjarko Lange, Qi Sun, Aaditya Prasad, Maxence Faldor, Yujin Tang, David Ha
- Institution: Sakana AI
- Task: Torch -> CUDA
gpuFLOPBench: Counting Without Running: Evaluating LLMs’ Reasoning About Code Complexity
- Gregory Bolet, Giorgis Georgakoudis, Konstantinos Parasyris, Harshitha Menon, Niranjan Hasabnis, Kirk W. Cameron, Gal Oren
- Institution: Stanford University
- Task: CUDA -> FLOPs

🔧 Method

Agent based pipelines

STARK: Strategic Team of Agents for Refining Kernels
- Juncheng Dong, Yang Yang, Tao Liu, Yang Wang, Feng Qi, Vahid Tarokh, Kaushik Rangadurai, Shuang Yang
- Institution: Meta Ranking AI Research
- Task: Torch -> CUDA
QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach
- Shouyang Dong, Yuanbo Wen, Jun Bi, Di Huang, Jiaming Guo, Jianxing Xu, Ruibai Xu, Xinkai Song, Yifan Hao, Xuehai Zhou, Tianshi Chen, Qi Guo, Yunji Chen
- Institution: University of Science and Technology of China, Cambricon Technologies, Institute of Computing Technology, Institute of Software
- Task: CUDA <-> BangC <-> Hip <-> VNNI
QiMeng-Attention: SOTA Attention Operator is generated by SOTA Attention Algorithm
- Qirui Zhou, Shaohui Peng, Weiqiang Xiong, Haixin Chen, Yuanbo Wen, Haochen Li, Ling Li, Qi Guo, Yongwei Zhao, Ke Gao, Ruizhi Chen, Yanjun Wu, Chen Zhao, Yunji Chen
- Institution: Institute of Software, Institute of Computing Technology
- Task: NL -> CUDA (Attention)
QiMeng-TensorOp: Automatically Generating High-Performance Tensor Operators with Hardware Primitives
- Xuzhi Zhang, Shaohui Peng, Qirui Zhou, Yuanbo Wen, Qi Guo, Ruizhi Chen, Xinguo Zhu, Weiqiang Xiong, Haixin Chen, Congying Ma, Ke Gao, Chen Zhao, Yanjun Wu, Yunji Chen, Ling Li
- Institution: Institute of Computing Technology, Institute of Software
- Task: NL -> Hardware-specific Tensor Operators (RISC-V, ARM, GPU)
QiMeng-GEMM: Automatically Generating High-Performance Matrix Multiplication Code by Exploiting Large Language Models
- Qirui Zhou, Yuanbo Wen, Ruizhi Chen, Ke Gao, Weiqiang Xiong, Ling Li, Qi Guo, Yanjun Wu, Yunji Chen
- Institution: Institute of Computing Technology, Institute of Software
- Task: NL -> CUDA (GEMM)

Domain specific Models

AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs
- Shangzhan Li, Zefan Wang, Ye He, Yuxuan Li, Qi Shi, Jianling Li, Yonggang Hu, Wanxiang Che, Xu Han, Zhiyuan Liu, Maosong Sun
- Institution: Tsinghua University
- Task: Torch -> Triton
QiMeng-MuPa: Mutual-Supervised Learning for Sequential-to-Parallel Code Translation
- Changxin Ke, Rui Zhang, Shuo Wang, Li Ding, Guangli Li, Yuanbo Wen, Shuoming Zhang, Ruiyuan Xu, Jin Qin, Jiaming Guo, Chenxi Wang, Ling Li, Qi Guo, Yunji Chen
- Institution: Institute of Computing Technology
- Task: C -> CUDA

Agentic RL

QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation
- Xinguo Zhu, Shaohui Peng, Jiaming Guo, Yunji Chen, Qi Guo, Yuanbo Wen, Hang Qin, Ruizhi Chen, Qirui Zhou, Ke Gao, Yanjun Wu, Chen Zhao, Ling Li
- Institution: Institute of Software, Institute of Computing Technology
- Task: Torch -> Triton
Kevin: Multi-Turn RL for Generating CUDA Kernels
- Carlo Baronio, Pietro Marsella, Ben Pan, Simon Guo, Silas Alberti
- Institution: Stanford University
- Task: Torch -> CUDA

Contribution

Feel free to open an issue or submit a pull request to correct errors or add work that has not yet been included in this project. You can also email us at [email protected] for any form of discussion and collaboration.

Citation

If you find this work useful, welcome to cite us.

@article{llm4kernel,
  title={LLM4Kernel: A Survey of Large Language Models for GPU Kernel Development},
  author={Changxin Ke},
  year={2025}
  url={https://github.com/kcxain/Awesome-LLM4Kernel}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome-LLM4Kernel

📖 Benchmarks