Butterfingrz

手残青年 Butterfingrz

10 followers · 133 following

Achievements

Stars

DeclK / DeepCute

Cute solutions to high performance CUDA kernels

C++ 1 1 Updated Jan 16, 2026

NVIDIA / nvshmem

NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process communication and coordination overheads by allowing programmer…

C++ 450 54 Updated Dec 31, 2025

YinLiu-91 / ncclOperationPlus

use ncclSend ncclRecv realize ncclSendrecv ncclGather ncclScatter ncclAlltoall

Cuda 8 3 Updated Mar 1, 2022

CalvinXKY / BasicCUDA

A tutorial for CUDA&PyTorch

C++ 177 37 Updated Jan 21, 2025

SandAI-org / MagiAttention

A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Training

Python 619 36 Updated Jan 17, 2026

Butterfingrz / hp_rms_norm

Forked from HydraQYH/hp_rms_norm

High performance RMSNorm Implement by using SM Core Storage(Registers and Shared Memory)

Cuda 1 Updated Jan 12, 2026

HazyResearch / ThunderKittens

Tile primitives for speedy kernels

Cuda 3,087 225 Updated Jan 17, 2026

infinigence / FlashOverlap

A lightweight design for computation-communication overlap.

Cuda 212 10 Updated Dec 25, 2025

Dao-AILab / quack

A Quirky Assortment of CuTe Kernels

Python 751 73 Updated Jan 14, 2026

NVIDIA / TensorRT-LLM

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…

Python 12,664 2,019 Updated Jan 19, 2026

HydraQYH / hp_rms_norm

High performance RMSNorm Implement by using SM Core Storage(Registers and Shared Memory)

Cuda 25 1 Updated Jan 17, 2026

HydraQYH / expert_specialization_moe

Expert Specialization MoE Solution based on CUTLASS

Cuda 25 1 Updated Dec 24, 2025

tile-ai / TileOPs

Python 78 17 Updated Jan 16, 2026

leimao / CUTLASS-Examples

CUTLASS and CuTe Examples

Cuda 117 14 Updated Nov 30, 2025

rchardx / cuda-gemm

C++ 41 6 Updated Nov 1, 2025

ColfaxResearch / cutlass-kernels

Cuda 257 38 Updated Jul 11, 2024

databricks / megablocks

Python 1,522 220 Updated Jun 26, 2025

HarryWu99 / funny_cute

Some funny cute/cuteDSL code snippets

Python 11 Updated Oct 31, 2025

feifeibear / long-context-attention

USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference

Python 627 77 Updated Jan 15, 2026

dmlc / dlpack

common in-memory tensor structure

C++ 1,143 156 Updated Dec 11, 2025

deepseek-ai / FlashMLA

FlashMLA: Efficient Multi-head Latent Attention Kernels

C++ 11,981 936 Updated Jan 16, 2026

zhuzilin / ring-flash-attention

Ring attention implementation with flash attention

Python 964 93 Updated Sep 10, 2025

fla-org / flash-linear-attention

🚀 Efficient implementations of state-of-the-art linear attention models

Python 4,250 353 Updated Jan 17, 2026

LoongServe / LoongServe

Jupyter Notebook 130 14 Updated Nov 11, 2024

flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving

Python 4,696 656 Updated Jan 19, 2026

fusioninfer / fusioninfer

A Kubernetes-native platform for orchestrating distributed LLM inference at scale

Go 8 3 Updated Jan 11, 2026

PrimeIntellect-ai / prime-rl

Async RL Training at Scale

Python 1,010 175 Updated Jan 19, 2026

apple / ml-cross-entropy

Python 578 60 Updated Sep 23, 2025

thu-ml / SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

Cuda 3,060 317 Updated Jan 17, 2026

stepfun-ai / StepMesh

C++ 340 32 Updated Jan 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

手残青年 Butterfingrz

Achievements

Achievements

Block or report Butterfingrz

Stars

DeclK / DeepCute

NVIDIA / nvshmem

YinLiu-91 / ncclOperationPlus

CalvinXKY / BasicCUDA

SandAI-org / MagiAttention

Butterfingrz / hp_rms_norm

HazyResearch / ThunderKittens

infinigence / FlashOverlap

Dao-AILab / quack

NVIDIA / TensorRT-LLM

HydraQYH / hp_rms_norm

HydraQYH / expert_specialization_moe

tile-ai / TileOPs

leimao / CUTLASS-Examples

rchardx / cuda-gemm

ColfaxResearch / cutlass-kernels

databricks / megablocks

HarryWu99 / funny_cute

feifeibear / long-context-attention

dmlc / dlpack

deepseek-ai / FlashMLA

zhuzilin / ring-flash-attention

fla-org / flash-linear-attention

LoongServe / LoongServe

flashinfer-ai / flashinfer

fusioninfer / fusioninfer

PrimeIntellect-ai / prime-rl

apple / ml-cross-entropy

thu-ml / SageAttention

stepfun-ai / StepMesh