TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…

Python 12,885 2,100 Updated Feb 15, 2026

brendangregg / FlameGraph

Stack trace visualizer

Perl 19,256 2,088 Updated Oct 20, 2024

Infrasys-AI / AIInfra

AIInfra（AI 基础设施）指AI系统从底层芯片等硬件，到上层软件栈支持AI大模型训练和推理。

Jupyter Notebook 6,082 829 Updated Dec 22, 2025

siboehm / SGEMM_CUDA

Fast CUDA matrix multiplication from scratch

Cuda 1,052 162 Updated Sep 2, 2025

MoonshotAI / Kimi-K2

Kimi K2 is the large language model series developed by Moonshot AI team

10,374 776 Updated Jan 21, 2026

NickvisionApps / Parabolic

Download web video and audio

C# 5,074 233 Updated Feb 15, 2026

facebookresearch / pifuhd

High-Resolution 3D Human Digitization from A Single Image.

Python 9,759 1,480 Updated Aug 19, 2024

facebookresearch / detr

End-to-End Object Detection with Transformers

Python 15,110 2,663 Updated Mar 12, 2024

deepspeedai / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Python 41,624 4,722 Updated Feb 13, 2026

juraam / stable-diffusion-from-scratch

Implementation of Stable Diffusion with PyTorch

Jupyter Notebook 361 22 Updated Feb 22, 2025

pytorch / helion

A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.

Python 749 103 Updated Feb 15, 2026

MekkCyber / CutlassAcademy

A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS

252 12 Updated May 6, 2025

MekkCyber / TritonAcademy

A repository to unravel the language of GPUs, making their kernel conversations easy to understand

Python 201 8 Updated Jun 1, 2025

StarsfieldAI / R1-V

Witness the aha moment of VLM with less than $3.

Python 4,032 285 Updated May 19, 2025

LMCache / LMCache

Supercharge Your LLM with the Fastest KV Cache Layer

Python 6,888 899 Updated Feb 15, 2026

dropbox / gemlite

Fast low-bit matmul kernels in Triton

Python 430 31 Updated Feb 1, 2026

Rudrashis Gorai rudrashisgorai

Highlights

Starred repositories

agents

dataflow

message-passing

Publish-subscribe pattern