Stars
Optimized primitives for collective multi-GPU communication
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory…
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
A high-performance inference engine for LLMs, optimized for diverse AI accelerators.
[CVPR 2023] DepGraph: Towards Any Structural Pruning; LLMs, Vision Foundation Models, etc.
optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052
Examples for Recommenders - easy to train and deploy on accelerated infrastructure.
Pytorch domain library for recommendation systems
Benchmark code for the "Online normalizer calculation for softmax" paper
A list of papers, docs, codes about model quantization. This repo is aimed to provide the info for model quantization research, we are continuously improving the project. Welcome to PR the works (p…
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without lossing end-to-end metrics across language, image, and video models.
模型压缩的小白入门教程,PDF下载地址 https://github.com/datawhalechina/awesome-compression/releases
No-code multi-agent framework to build LLM Agents, workflows and applications with your data
Cost-efficient and pluggable Infrastructure components for GenAI inference
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
vLLM’s reference system for K8S-native cluster-wide deployment with community-driven performance optimization
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
This repository contains integer operators on GPUs for PyTorch.
High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.
Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"
SGLang is a fast serving framework for large language models and vision language models.
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
FlashMLA: Efficient Multi-head Latent Attention Kernels