Stars
NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process communication and coordination overheads by allowing programmer…
use ncclSend ncclRecv realize ncclSendrecv ncclGather ncclScatter ncclAlltoall
A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Training
Butterfingrz / hp_rms_norm
Forked from HydraQYH/hp_rms_normHigh performance RMSNorm Implement by using SM Core Storage(Registers and Shared Memory)
Tile primitives for speedy kernels
A lightweight design for computation-communication overlap.
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…
High performance RMSNorm Implement by using SM Core Storage(Registers and Shared Memory)
Expert Specialization MoE Solution based on CUTLASS
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
FlashMLA: Efficient Multi-head Latent Attention Kernels
Ring attention implementation with flash attention
🚀 Efficient implementations of state-of-the-art linear attention models
FlashInfer: Kernel Library for LLM Serving
A Kubernetes-native platform for orchestrating distributed LLM inference at scale
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.