Thanks to visit codestin.com
Credit goes to github.com

Skip to content
View yanghailong-git's full-sized avatar

Block or report yanghailong-git

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

An external log connector example for LMCache

Python 4 Updated Jun 13, 2025

The driver for LMCache core to run in vLLM

Python 59 32 Updated Feb 4, 2025
Python 164 24 Updated Jul 15, 2025

Perplexity GPU Kernels

C++ 553 75 Updated Nov 7, 2025

[🔥updating ...] AI 自动量化交易机器人(完全本地部署) AI-powered Quantitative Investment Research Platform. 📃 online docs: https://ufund-me.github.io/Qbot ✨ :news: qbot-mini: https://github.com/Charmve/iQuant

Jupyter Notebook 15,868 2,260 Updated Jul 6, 2025

The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.

Jupyter Notebook 92 6 Updated Jul 17, 2025

The evaluation framework for training-free sparse attention in LLMs

Python 110 8 Updated Oct 13, 2025

[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

Python 276 17 Updated Aug 31, 2024

📰 Must-read papers on KV Cache Compression (constantly updating 🤗).

639 20 Updated Sep 30, 2025

The Official Implementation of Ada-KV [NeurIPS 2025]

Python 125 5 Updated Nov 26, 2025

[NeurIPS'25 Oral] Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)

Python 187 9 Updated Dec 30, 2025

KV cache compression for high-throughput LLM inference

Python 149 5 Updated Feb 5, 2025

[ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

Python 279 21 Updated May 1, 2025

[Neurips 2025] R-KV: Redundancy-aware KV Cache Compression for Reasoning Models

Python 1,166 188 Updated Oct 16, 2025
Python 35 10 Updated Oct 11, 2025

LLM KV cache compression made easy

Python 819 94 Updated Jan 14, 2026

Implement some method of LLM KV Cache Sparsity

Python 41 2 Updated Jun 6, 2024

📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

Python 4,917 336 Updated Jan 18, 2026

Supercharge Your LLM with the Fastest KV Cache Layer

Python 6,719 862 Updated Jan 18, 2026

注释的nano_vllm仓库,并且完成了MiniCPM4的适配以及注册新模型的功能

Python 143 27 Updated Aug 11, 2025

a minimal cache manager for PagedAttention, on top of llama3.

Python 130 11 Updated Aug 26, 2024

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 1,047 104 Updated Dec 30, 2024

A single-file educational implementation for understanding vLLM's core concepts and running LLM inference.

Python 33 4 Updated Jun 22, 2025

Serving multiple LoRA finetuned LLM as one

Python 1,135 57 Updated May 8, 2024

Collection of kernels written in Triton language

174 9 Updated Apr 5, 2025

Survey on LLM Agents (Published on CoLing 2025)

461 18 Updated Oct 3, 2025

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda 515 88 Updated Sep 8, 2024

Sample codes for my CUDA programming book

Cuda 1,978 381 Updated Dec 14, 2025

Step-by-step optimization of CUDA SGEMM

Cuda 424 55 Updated Mar 30, 2022

how to optimize some algorithm in cuda.

Cuda 2,771 250 Updated Jan 16, 2026
Next