Stars
MiroThinker is an open source deep research agent optimized for research and prediction. It achieves a 60.2% Avg@8 score on the challenging GAIA benchmark.
Boosting GPU utilization for LLM serving via dynamic spatial-temporal prefill & decode orchestration
Prototyp MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism
A throughput-oriented high-performance serving framework for LLMs
Offline optimization of your disaggregated Dynamo graph
FlashInfer: Kernel Library for LLM Serving
Repository for MLCommons Chakra schema and tools
ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale
🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality degradation across Weight-Only Quantization, MXFP4, NVFP4, GGUF, and adaptive schemes.
Tile-Based Runtime for Ultra-Low-Latency LLM Inference
[WWW 2026] 🛠️ DeepAgent: A General Reasoning Agent with Scalable Toolsets
Repo for SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting (ISCA25)
SGLang is a high-performance serving framework for large language models and multimodal models.
A machine learning accelerator core designed for energy-efficient AI at the edge.
Parametric floating-point unit with support for standard RISC-V formats and operations as well as transprecision formats.
H265 decoder write in verilog, verified on Xilinx ZYNQ7035
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
Quantize transformers to any learned arbitrary 4-bit numeric format
Quantization library for PyTorch. Support low-precision and mixed-precision quantization, with hardware implementation through TVM.
MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models
[ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule
This repository contains the training code of ParetoQ introduced in our work "ParetoQ Scaling Laws in Extremely Low-bit LLM Quantization"
Official repo of "Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs"
[ICLR 2025] Systematic Outliers in Large Language Models.
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.