-
17:47
(UTC +08:00)
Highlights
- Pro
Lists (5)
Sort Name ascending (A-Z)
Stars
A compact implementation of SGLang, designed to demystify the complexities of modern LLM serving systems.
A high-performance distributed file system designed to address the challenges of AI training and inference workloads.
A scheduling framework for multitasking over diverse XPUs, including GPUs, NPUs, ASICs, and FPGAs
FalconFS is a high-performance distributed file system (DFS) designed for AI workloads.
A GPU benchmark tool for evaluating GPUs and CPUs on mixed operational intensity kernels (CUDA, OpenCL, HIP, SYCL, OpenMP)
High-Throughput, Cost-Effective Billion-Scale Vector Search with a Single GPU [to appear in SIGMOD'26]
A low-latency, billion-scale, and updatable graph-based vector store on SSD.
PiKV: KV Cache Management System for Mixture of Experts [Efficient ML System]
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
FlashInfer: Kernel Library for LLM Serving
This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?
Hackable and optimized Transformers building blocks, supporting a composable construction.
A next.js web application that integrates AI capabilities with draw.io diagrams. This app allows you to create, modify, and enhance diagrams through natural language commands and AI-assisted visual…
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
PegaFlow is a high-performance KV cache offloading solution for vLLM v1 on single-node multi-GPU setups.
Supercharge Your LLM with the Fastest KV Cache Layer
[ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Tool for safe ergonomic Rust/C++ interop driven from existing C++ headers
Doing simple retrieval from LLM models at various context lengths to measure accuracy
Running large language models on a single GPU for throughput-oriented scenarios.
NEO is a LLM inference engine built to save the GPU memory crisis by CPU offloading
A Datacenter Scale Distributed Inference Serving Framework
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
The Triton Inference Server provides an optimized cloud and edge inferencing solution.
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.