Highlights
- Pro
Lists (4)
Sort Name ascending (A-Z)
Stars
Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond
Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.
The fastest, lightest, and easiest-to-integrate AI gateway on the market. Fully open-sourced.
Scalable long-context LLM decoding that leverages sparsity—by treating the KV cache as a vector storage system.
Tongyi Deep Research, the Leading Open-source Deep Research Agent
A large-scale simulation framework for LLM inference
TokenSim is a tool for simulating the behavior of large language models (LLMs) in a distributed environment.
Unified KV Cache Compression Methods for Auto-Regressive Models
📰 Must-read papers on KV Cache Compression (constantly updating 🤗).
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
Userspace eBPF runtime for Observability, Network, GPU & General Extensions Framework
整理开源的中文大语言模型,以规模较小、可私有化部署、训练成本较低的模型为主,包括底座模型,垂直领域微调及应用,数据集与教程等。
LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU via HF, vLLM, and SGLang.
This repository is dedicated to store the different optimizations for MPI collective IO operations that I have performed.
Exploring Dynamic Load Balancing Algorithms for Block-Structured Mesh-and-Particle Simulations in AMReX
A powerful coding agent toolkit providing semantic retrieval and editing capabilities (MCP server & other integrations)
a mllm inference engine for academic research
magic-trace collects and displays high-resolution traces of what a process is doing
Supercharge Your LLM with the Fastest KV Cache Layer
MCP server: using eBPF to tracing your kernel
本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)
这是一个简单的技术科普教程项目,主要聚焦于解释一些有趣的,前沿的技术概念和原理。每篇文章都力求在 5 分钟内阅读完成。
A next-gen FOSS self-hosted unified zero trust secure access platform that can operate as a remote access VPN, a ZTNA platform, API/AI/MCP gateway, a PaaS, an ngrok-alternative and a homelab infras…
A C++20 library for fast serialization, deserialization and validation using reflection. Supports JSON, Avro, BSON, Cap'n Proto, CBOR, CSV, flexbuffers, msgpack, parquet, TOML, UBJSON, XML, YAML / …