Thanks to visit codestin.com
Credit goes to github.com

Skip to content
View Butterfingrz's full-sized avatar

Block or report Butterfingrz

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Cute solutions to high performance CUDA kernels

C++ 1 1 Updated Jan 16, 2026

NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process communication and coordination overheads by allowing programmer…

C++ 450 54 Updated Dec 31, 2025

use ncclSend ncclRecv realize ncclSendrecv ncclGather ncclScatter ncclAlltoall

Cuda 8 3 Updated Mar 1, 2022

A tutorial for CUDA&PyTorch

C++ 177 37 Updated Jan 21, 2025

A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Training

Python 619 36 Updated Jan 17, 2026

High performance RMSNorm Implement by using SM Core Storage(Registers and Shared Memory)

Cuda 1 Updated Jan 12, 2026

Tile primitives for speedy kernels

Cuda 3,087 225 Updated Jan 17, 2026

A lightweight design for computation-communication overlap.

Cuda 212 10 Updated Dec 25, 2025

A Quirky Assortment of CuTe Kernels

Python 751 73 Updated Jan 14, 2026

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…

Python 12,664 2,019 Updated Jan 19, 2026

High performance RMSNorm Implement by using SM Core Storage(Registers and Shared Memory)

Cuda 25 1 Updated Jan 17, 2026

Expert Specialization MoE Solution based on CUTLASS

Cuda 25 1 Updated Dec 24, 2025
Python 78 17 Updated Jan 16, 2026

CUTLASS and CuTe Examples

Cuda 117 14 Updated Nov 30, 2025
C++ 41 6 Updated Nov 1, 2025
Python 1,522 220 Updated Jun 26, 2025

Some funny cute/cuteDSL code snippets

Python 11 Updated Oct 31, 2025

USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference

Python 627 77 Updated Jan 15, 2026

common in-memory tensor structure

C++ 1,143 156 Updated Dec 11, 2025

FlashMLA: Efficient Multi-head Latent Attention Kernels

C++ 11,981 936 Updated Jan 16, 2026

Ring attention implementation with flash attention

Python 964 93 Updated Sep 10, 2025

🚀 Efficient implementations of state-of-the-art linear attention models

Python 4,250 353 Updated Jan 17, 2026
Jupyter Notebook 130 14 Updated Nov 11, 2024

FlashInfer: Kernel Library for LLM Serving

Python 4,696 656 Updated Jan 19, 2026

A Kubernetes-native platform for orchestrating distributed LLM inference at scale

Go 8 3 Updated Jan 11, 2026

Async RL Training at Scale

Python 1,010 175 Updated Jan 19, 2026
Python 578 60 Updated Sep 23, 2025

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

Cuda 3,060 317 Updated Jan 17, 2026
C++ 340 32 Updated Jan 4, 2026
Next