Thanks to visit codestin.com
Credit goes to github.com

zzxzzx123

Follow

zzxzzx123

Follow

GAN&HPC

3 followers · 7 following

UESTC
Chengdu

Stars

NVIDIA / nccl

Optimized primitives for collective multi-GPU communication

C++ 4,190 1,051 Updated Oct 18, 2025

NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory…

Python 2,864 533 Updated Oct 29, 2025

tile-ai / tilelang

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

C++ 3,779 286 Updated Oct 29, 2025

weishengying / cutlass_flash_atten_fp8

使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention

Cuda 77 6 Updated Aug 12, 2024

kvcache-ai / Mooncake

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

C++ 4,176 410 Updated Oct 30, 2025

jd-opensource / xllm

A high-performance inference engine for LLMs, optimized for diverse AI accelerators.

C++ 611 74 Updated Oct 29, 2025

VainF / Torch-Pruning

[CVPR 2023] DepGraph: Towards Any Structural Pruning; LLMs, Vision Foundation Models, etc.

Python 3,166 365 Updated Sep 7, 2025

hrcheng1066 / awesome-pruning

282 18 Updated Aug 20, 2024

bytedance / ByteTransformer

optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052

C++ 479 37 Updated Mar 15, 2024

NVIDIA / recsys-examples

Examples for Recommenders - easy to train and deploy on accelerated infrastructure.

Python 160 36 Updated Oct 30, 2025

meta-pytorch / torchrec

Pytorch domain library for recommendation systems

Python 2,383 568 Updated Oct 30, 2025

NVIDIA / online-softmax

Benchmark code for the "Online normalizer calculation for softmax" paper

Cuda 102 10 Updated Jul 27, 2018

Efficient-ML / Awesome-Model-Quantization

A list of papers, docs, codes about model quantization. This repo is aimed to provide the info for model quantization research, we are continuously improving the project. Welcome to PR the works (p…

2,249 227 Updated Mar 4, 2025

thu-ml / SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without lossing end-to-end metrics across language, image, and video models.

Cuda 2,601 249 Updated Oct 28, 2025

datawhalechina / awesome-compression

模型压缩的小白入门教程，PDF下载地址 https://github.com/datawhalechina/awesome-compression/releases

334 37 Updated Jun 14, 2025

trypromptly / LLMStack

No-code multi-agent framework to build LLM Agents, workflows and applications with your data

Python 2,058 303 Updated Dec 11, 2024

vllm-project / aibrix

Cost-efficient and pluggable Infrastructure components for GenAI inference

Go 4,319 477 Updated Oct 29, 2025

xlite-dev / Awesome-LLM-Inference

📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

Python 4,643 317 Updated Aug 19, 2025

vllm-project / FlashMLA

Forked from deepseek-ai/FlashMLA

C++ 7 10 Updated Oct 22, 2025

vllm-project / production-stack

vLLM’s reference system for K8S-native cluster-wide deployment with community-driven performance optimization

Python 1,891 312 Updated Oct 28, 2025

mit-han-lab / llm-awq

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Python 3,324 277 Updated Jul 17, 2025

mit-han-lab / smoothquant

[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Python 1,536 183 Updated Jul 12, 2024

Guangxuan-Xiao / torch-int

This repository contains integer operators on GPUs for PyTorch.

Python 220 56 Updated Sep 29, 2023

thu-pacman / chitu

High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.

Python 1,315 88 Updated Oct 24, 2025

facebookresearch / LLM-QAT

Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"

Python 318 24 Updated Mar 4, 2025

sgl-project / sgl-learning-materials

Materials for learning SGLang

623 50 Updated Oct 26, 2025

sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.

Python 19,479 3,207 Updated Oct 30, 2025

alibaba / rtp-llm

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

C++ 908 110 Updated Oct 30, 2025

StrongSpoon / tvm.schedule

examples for tvm schedule API

Python 101 35 Updated Jun 12, 2023

deepseek-ai / FlashMLA

FlashMLA: Efficient Multi-head Latent Attention Kernels

C++ 11,838 897 Updated Sep 30, 2025