wang121201

wang121201

2 followers · 9 following

Lists (2)

Sort

🔮 Future ideas

1 repository

✨ Inspiration

1 repository

Starred repositories

microsoft / MInference

[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filli…

Python 1,144 62 Updated Sep 30, 2025

ovg-project / kvcached

Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond

Python 556 48 Updated Oct 27, 2025

NVIDIA / NVTX

The NVIDIA® Tools Extension SDK (NVTX) is a C-based Application Programming Interface (API) for annotating events, code ranges, and resources in your applications.

C++ 466 62 Updated Oct 29, 2025

AmberLJC / LLMSys-PaperList

Large Language Model (LLM) Systems Paper List

1,574 83 Updated Oct 18, 2025

NEO-MLSys25 / NEO

NEO is a LLM inference engine built to save the GPU memory crisis by CPU offloading

Python 67 16 Updated Jun 16, 2025

amirgholami / ai_and_memory_wall

AI and Memory Wall

220 27 Updated Mar 23, 2024

ChaseLab-PKU / InstAttention

InstAttention: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference

C 10 1 Updated Mar 30, 2025

hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)

Python 61,501 7,437 Updated Oct 30, 2025

EfficientMoE / MoE-Infinity

PyTorch library for cost-effective, fast and easy serving of MoE models.

Python 253 18 Updated Oct 15, 2025

FFY0 / AdaKV

The Official Implementation of Ada-KV [NeurIPS 2025]

Python 108 3 Updated Sep 25, 2025

OSU-STARLAB / UVM_benchmark

Roff 30 7 Updated Sep 9, 2020

dchasap / itp_asplos25_AE

Instruction-aware cooperative TLB and cache replacement policies - Repository for the iTP and xPTP replacement policies for the second level TLB and L2 cache.

C++ 1 1 Updated Sep 23, 2025

naklecha / llama3-from-scratch

llama3 implementation one matrix multiplication at a time

Jupyter Notebook 15,182 1,291 Updated May 23, 2024

wdndev / ai_interview_note

DL & ML & RS

Python 571 46 Updated Nov 23, 2024

wdndev / llm_interview_note

主要记录大语言大模型（LLMs）算法（应用）工程师相关的知识及面试题

HTML 10,639 1,085 Updated Apr 30, 2025

wdndev / llama3-from-scratch-zh

从零实现一个 llama3 中文版

Jupyter Notebook 982 97 Updated Jun 12, 2024

doorteeth / learn_cuda

Cuda 43 7 Updated Jan 13, 2022

microsoft / AI-System

System for AI Education Resource.

Python 4,156 520 Updated Oct 25, 2024

zhaochenyang20 / Awesome-ML-SYS-Tutorial

My learning notes/codes for ML SYS.

Python 4,030 243 Updated Oct 6, 2025

fzyzcjy / torch_memory_saver

Allow torch tensor memory to be released and resumed later

Python 159 26 Updated Oct 31, 2025

uclaml / SPPO

The official implementation of Self-Play Preference Optimization (SPPO)

Python 581 47 Updated Jan 23, 2025

CMU-SAFARI / Ariadne

Ariadne is a new compressed swap scheme for mobile devices that reduces application relaunch latency and CPU usage while increasing the number of live applications for enhanced user experience. Des…

C 7 1 Updated Feb 19, 2025