Stars
An external log connector example for LMCache
[🔥updating ...] AI 自动量化交易机器人(完全本地部署) AI-powered Quantitative Investment Research Platform. 📃 online docs: https://ufund-me.github.io/Qbot ✨ :news: qbot-mini: https://github.com/Charmve/iQuant
The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.
The evaluation framework for training-free sparse attention in LLMs
[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
📰 Must-read papers on KV Cache Compression (constantly updating 🤗).
[NeurIPS'25 Oral] Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)
KV cache compression for high-throughput LLM inference
[ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
[Neurips 2025] R-KV: Redundancy-aware KV Cache Compression for Reasoning Models
Implement some method of LLM KV Cache Sparsity
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
Supercharge Your LLM with the Fastest KV Cache Layer
注释的nano_vllm仓库,并且完成了MiniCPM4的适配以及注册新模型的功能
a minimal cache manager for PagedAttention, on top of llama3.
Flash Attention in ~100 lines of CUDA (forward pass only)
A single-file educational implementation for understanding vLLM's core concepts and running LLM inference.
Collection of kernels written in Triton language
Survey on LLM Agents (Published on CoLing 2025)
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Sample codes for my CUDA programming book
Step-by-step optimization of CUDA SGEMM
how to optimize some algorithm in cuda.