Thanks to visit codestin.com
Credit goes to github.com

Skip to content
View sayybro's full-sized avatar

Highlights

  • Pro

Block or report sayybro

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

[ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation

Python 238 16 Updated Dec 16, 2024

[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.

Cuda 758 65 Updated Oct 31, 2025
Python 133 6 Updated Sep 12, 2025

Code and data for the Chain-of-Draft (CoD) paper

Python 333 40 Updated Mar 11, 2025

📰 Must-read papers and blogs on Speculative Decoding ⚡️

1,004 52 Updated Oct 25, 2025

[TMLR 2025] Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

675 33 Updated Oct 20, 2025

[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Cuda 344 36 Updated Jul 10, 2025
Python 14 1 Updated Apr 22, 2025

Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval (ICCV 2025 Highlight)

Python 19 Updated Aug 1, 2025

Official Implementation (Pytorch) of the "Representation Shift: Unifying Token Compression with FlashAttention", ICCV 2025

23 2 Updated Jul 30, 2025

[ICCV 2025] SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs

Python 76 2 Updated Oct 19, 2025

[CVPR 2025] CoDe: Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient

Python 107 4 Updated Sep 27, 2025

Official implementation for our paper "Scaling Diffusion Transformers Efficiently via μP".

Python 91 1 Updated Nov 2, 2025

[ICML 2025 Oral] Mixture of Lookup Experts

Python 53 2 Updated May 14, 2025

Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**

Jupyter Notebook 206 15 Updated Feb 13, 2025

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Python 3,331 277 Updated Jul 17, 2025

[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filli…

Python 1,146 63 Updated Sep 30, 2025

[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding

Python 130 9 Updated Dec 4, 2024

Official Implementation of "Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding" (ICML'25)

Python 6 2 Updated Jul 11, 2025

[ICML 2025] Reward-guided Speculative Decoding (RSD) for efficiency and effectiveness.

Python 50 6 Updated May 2, 2025

Benchmark TTFT, TPOT, T/s, Speedup

Python 7 Updated Jun 2, 2025

[ICLR 2025] SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration

Python 56 4 Updated Feb 21, 2025

A library for efficient similarity search and clustering of dense vectors.

C++ 37,789 4,096 Updated Nov 5, 2025
Python 17 Updated Mar 2, 2025

The official code for the paper: LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs

Python 111 1 Updated Jul 1, 2025
12 Updated Mar 28, 2025
28 Updated Jun 13, 2025

This is the project for 'USG'.

CSS 29 Updated Apr 7, 2025

[Neurips 2025] R-KV: Redundancy-aware KV Cache Compression for Reasoning Models

Python 1,143 184 Updated Oct 16, 2025

VidKV: Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models

Python 22 Updated Mar 26, 2025
Next