Highlights
- Pro
Stars
[ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation
[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.
Code and data for the Chain-of-Draft (CoD) paper
📰 Must-read papers and blogs on Speculative Decoding ⚡️
[TMLR 2025] Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval (ICCV 2025 Highlight)
Official Implementation (Pytorch) of the "Representation Shift: Unifying Token Compression with FlashAttention", ICCV 2025
[ICCV 2025] SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs
[CVPR 2025] CoDe: Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient
Official implementation for our paper "Scaling Diffusion Transformers Efficiently via μP".
Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filli…
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
Official Implementation of "Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding" (ICML'25)
[ICML 2025] Reward-guided Speculative Decoding (RSD) for efficiency and effectiveness.
[ICLR 2025] SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration
A library for efficient similarity search and clustering of dense vectors.
The official code for the paper: LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs
[Neurips 2025] R-KV: Redundancy-aware KV Cache Compression for Reasoning Models
VidKV: Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models