Thanks to visit codestin.com
Credit goes to github.com

Skip to content
View wang121201's full-sized avatar

Block or report wang121201

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filli…

Python 1,144 62 Updated Sep 30, 2025

Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond

Python 556 48 Updated Oct 27, 2025

The NVIDIA® Tools Extension SDK (NVTX) is a C-based Application Programming Interface (API) for annotating events, code ranges, and resources in your applications.

C++ 466 62 Updated Oct 29, 2025

Large Language Model (LLM) Systems Paper List

1,574 83 Updated Oct 18, 2025

NEO is a LLM inference engine built to save the GPU memory crisis by CPU offloading

Python 67 16 Updated Jun 16, 2025

AI and Memory Wall

220 27 Updated Mar 23, 2024

InstAttention: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference

C 10 1 Updated Mar 30, 2025

Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)

Python 61,501 7,437 Updated Oct 30, 2025

PyTorch library for cost-effective, fast and easy serving of MoE models.

Python 253 18 Updated Oct 15, 2025

The Official Implementation of Ada-KV [NeurIPS 2025]

Python 108 3 Updated Sep 25, 2025

Instruction-aware cooperative TLB and cache replacement policies - Repository for the iTP and xPTP replacement policies for the second level TLB and L2 cache.

C++ 1 1 Updated Sep 23, 2025

llama3 implementation one matrix multiplication at a time

Jupyter Notebook 15,182 1,291 Updated May 23, 2024

DL & ML & RS

Python 571 46 Updated Nov 23, 2024

主要记录大语言大模型(LLMs) 算法(应用)工程师相关的知识及面试题

HTML 10,639 1,085 Updated Apr 30, 2025

从零实现一个 llama3 中文版

Jupyter Notebook 982 97 Updated Jun 12, 2024
Cuda 43 7 Updated Jan 13, 2022

System for AI Education Resource.

Python 4,156 520 Updated Oct 25, 2024

My learning notes/codes for ML SYS.

Python 4,030 243 Updated Oct 6, 2025

Allow torch tensor memory to be released and resumed later

Python 159 26 Updated Oct 31, 2025

The official implementation of Self-Play Preference Optimization (SPPO)

Python 581 47 Updated Jan 23, 2025

Ariadne is a new compressed swap scheme for mobile devices that reduces application relaunch latency and CPU usage while increasing the number of live applications for enhanced user experience. Des…

C 7 1 Updated Feb 19, 2025

Systematic and comprehensive benchmarks for LLM systems.

Python 38 17 Updated Sep 30, 2025

Supercharge Your LLM with the Fastest KV Cache Layer

Python 5,738 680 Updated Nov 1, 2025

Fast Multimodal LLM on Mobile Devices

C++ 1,150 139 Updated Nov 1, 2025

[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration

Python 238 24 Updated Nov 18, 2024

Nano vLLM

Python 7,263 935 Updated Aug 31, 2025
C++ 40 8 Updated Sep 19, 2023

Distribute and run LLMs with a single file.

C++ 23,286 1,233 Updated Nov 1, 2025
Next