Thanks to visit codestin.com
Credit goes to github.com

Skip to content
View zzxzzx123's full-sized avatar
  • UESTC
  • Chengdu

Block or report zzxzzx123

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Optimized primitives for collective multi-GPU communication

C++ 4,190 1,051 Updated Oct 18, 2025

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory…

Python 2,864 533 Updated Oct 29, 2025

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

C++ 3,779 286 Updated Oct 29, 2025

使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention

Cuda 77 6 Updated Aug 12, 2024

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

C++ 4,176 410 Updated Oct 30, 2025

A high-performance inference engine for LLMs, optimized for diverse AI accelerators.

C++ 611 74 Updated Oct 29, 2025

[CVPR 2023] DepGraph: Towards Any Structural Pruning; LLMs, Vision Foundation Models, etc.

Python 3,166 365 Updated Sep 7, 2025

optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052

C++ 479 37 Updated Mar 15, 2024

Examples for Recommenders - easy to train and deploy on accelerated infrastructure.

Python 160 36 Updated Oct 30, 2025

Pytorch domain library for recommendation systems

Python 2,383 568 Updated Oct 30, 2025

Benchmark code for the "Online normalizer calculation for softmax" paper

Cuda 102 10 Updated Jul 27, 2018

A list of papers, docs, codes about model quantization. This repo is aimed to provide the info for model quantization research, we are continuously improving the project. Welcome to PR the works (p…

2,249 227 Updated Mar 4, 2025

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without lossing end-to-end metrics across language, image, and video models.

Cuda 2,601 249 Updated Oct 28, 2025

模型压缩的小白入门教程,PDF下载地址 https://github.com/datawhalechina/awesome-compression/releases

334 37 Updated Jun 14, 2025

No-code multi-agent framework to build LLM Agents, workflows and applications with your data

Python 2,058 303 Updated Dec 11, 2024

Cost-efficient and pluggable Infrastructure components for GenAI inference

Go 4,319 477 Updated Oct 29, 2025

📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

Python 4,643 317 Updated Aug 19, 2025
C++ 7 10 Updated Oct 22, 2025

vLLM’s reference system for K8S-native cluster-wide deployment with community-driven performance optimization

Python 1,891 312 Updated Oct 28, 2025

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Python 3,324 277 Updated Jul 17, 2025

[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Python 1,536 183 Updated Jul 12, 2024

This repository contains integer operators on GPUs for PyTorch.

Python 220 56 Updated Sep 29, 2023

High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.

Python 1,315 88 Updated Oct 24, 2025

Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"

Python 318 24 Updated Mar 4, 2025

Materials for learning SGLang

623 50 Updated Oct 26, 2025

SGLang is a fast serving framework for large language models and vision language models.

Python 19,479 3,207 Updated Oct 30, 2025

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

C++ 908 110 Updated Oct 30, 2025

examples for tvm schedule API

Python 101 35 Updated Jun 12, 2023

FlashMLA: Efficient Multi-head Latent Attention Kernels

C++ 11,838 897 Updated Sep 30, 2025
Next