Thanks to visit codestin.com
Credit goes to github.com

Skip to content
View revollllt's full-sized avatar

Block or report revollllt

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

MiroThinker is an open source deep research agent optimized for research and prediction. It achieves a 60.2% Avg@8 score on the challenging GAIA benchmark.

Python 5,354 381 Updated Jan 16, 2026

Boosting GPU utilization for LLM serving via dynamic spatial-temporal prefill & decode orchestration

Python 29 3 Updated Jan 8, 2026

Prototyp MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

Jupyter Notebook 26 2 Updated Apr 4, 2025

A throughput-oriented high-performance serving framework for LLMs

Jupyter Notebook 938 46 Updated Oct 29, 2025

Offline optimization of your disaggregated Dynamo graph

Python 153 50 Updated Jan 19, 2026

FlashInfer: Kernel Library for LLM Serving

Python 4,698 656 Updated Jan 19, 2026

Repository for MLCommons Chakra schema and tools

Python 152 62 Updated Oct 23, 2025

ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale

C++ 512 179 Updated Jan 3, 2026

llm & rl

Jupyter Notebook 267 27 Updated Oct 24, 2025

🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality degradation across Weight-Only Quantization, MXFP4, NVFP4, GGUF, and adaptive schemes.

Python 816 72 Updated Jan 19, 2026

Tile-Based Runtime for Ultra-Low-Latency LLM Inference

Python 533 26 Updated Dec 23, 2025

[WWW 2026] 🛠️ DeepAgent: A General Reasoning Agent with Scalable Toolsets

Python 941 122 Updated Jan 14, 2026

Repo for SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting (ISCA25)

C++ 70 9 Updated Apr 25, 2025

SGLang is a high-performance serving framework for large language models and multimodal models.

Python 22,545 4,095 Updated Jan 19, 2026

A machine learning accelerator core designed for energy-efficient AI at the edge.

Emacs Lisp 1,996 223 Updated Jan 16, 2026

Parametric floating-point unit with support for standard RISC-V formats and operations as well as transprecision formats.

SystemVerilog 558 146 Updated Oct 21, 2025

H265 decoder write in verilog, verified on Xilinx ZYNQ7035

SystemVerilog 81 41 Updated Sep 2, 2021

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.

Python 885 72 Updated Nov 26, 2025
Python 93 13 Updated Nov 16, 2025

QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning

C++ 161 17 Updated Nov 11, 2025

Quantize transformers to any learned arbitrary 4-bit numeric format

Python 51 5 Updated Jul 10, 2025

Quantization library for PyTorch. Support low-precision and mixed-precision quantization, with hardware implementation through TVM.

Python 453 84 Updated May 15, 2023

MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models

Cuda 25 3 Updated Dec 20, 2025

LLM Inference with Microscaling Format

Python 34 4 Updated Nov 12, 2024

[ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule

Python 428 23 Updated Sep 15, 2025

This repository contains the training code of ParetoQ introduced in our work "ParetoQ Scaling Laws in Extremely Low-bit LLM Quantization"

Python 117 8 Updated Oct 15, 2025

Official repo of "Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs"

Python 109 9 Updated Jan 12, 2026

[ICLR 2025] Systematic Outliers in Large Language Models.

Python 9 2 Updated Feb 11, 2025

X-SET's artifacts for MICRO'58

Python 4 Updated Sep 1, 2025

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

Cuda 3,065 318 Updated Jan 17, 2026
Next