Stars
A lightweight design for computation-communication overlap.
DeepEP: an efficient expert-parallel communication library
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
slime is an LLM post-training framework for RL Scaling.
[NeurIPS 2025] ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
A list of awesome compiler projects and papers for tensor computation and deep learning.
Enhanced compiler frontend. Support Auto Compute + Auto Schedule + Auto Tensorize for tensor compilers.
Distributed Compiler based on Triton for Parallel Systems
KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA (+ more DSLs)
Github Pages template based upon HTML and Markdown for personal, portfolio-based websites.
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
⏰ Collaboratively track worldwide conference deadlines (Website, Python Cli, Wechat Applet) / If you find it useful, please star this project, thanks~
flash attention tutorial written in python, triton, cuda, cutlass
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstra…
collection of benchmarks to measure basic GPU capabilities