Lists (2)
Sort Name ascending (A-Z)
Starred repositories
[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filli…
Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond
The NVIDIA® Tools Extension SDK (NVTX) is a C-based Application Programming Interface (API) for annotating events, code ranges, and resources in your applications.
Large Language Model (LLM) Systems Paper List
NEO is a LLM inference engine built to save the GPU memory crisis by CPU offloading
InstAttention: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
PyTorch library for cost-effective, fast and easy serving of MoE models.
Instruction-aware cooperative TLB and cache replacement policies - Repository for the iTP and xPTP replacement policies for the second level TLB and L2 cache.
llama3 implementation one matrix multiplication at a time
主要记录大语言大模型(LLMs) 算法(应用)工程师相关的知识及面试题
My learning notes/codes for ML SYS.
Allow torch tensor memory to be released and resumed later
The official implementation of Self-Play Preference Optimization (SPPO)
Ariadne is a new compressed swap scheme for mobile devices that reduces application relaunch latency and CPU usage while increasing the number of live applications for enhanced user experience. Des…
Systematic and comprehensive benchmarks for LLM systems.
Supercharge Your LLM with the Fastest KV Cache Layer
[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration
Distribute and run LLMs with a single file.