Stars
ccint - a C/C++ interpreter, built on top of Clang and LLVM compiler infrastructure
A novell, highly-optimized CUDA implementation of k-means algorithm.
Cache library and distributed caching server. Memcached compatible.
This project aims to collect the latest "call for reviewers" links from various top CS/ML/AI conferences/journals
The AI developer platform. Use Weights & Biases to train and fine-tune models, and manage models from experimentation to production.
Fine-tuning & Reinforcement Learning for LLMs. 🦥 Train OpenAI gpt-oss, DeepSeek-R1, Qwen3, Gemma 3, TTS 2x faster with 70% less VRAM.
Train transformer language models with reinforcement learning.
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Fully open data curation for reasoning models
为GPT/GLM等LLM大语言模型提供实用化交互接口,特别优化论文阅读/润色/写作体验,模块化设计,支持自定义快捷按钮&函数插件,支持Python和C++等项目剖析&自译解功能,PDF/LaTex论文翻译&总结功能,支持并行问询多种LLM模型,支持chatglm3等本地模型。接入通义千问, deepseekcoder, 讯飞星火, 文心一言, llama2, rwkv, claude2, m…
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
A high-performance distributed file system designed to address the challenges of AI training and inference workloads.
DeepEP: an efficient expert-parallel communication library
The Torch-MLIR project aims to provide first class support from the PyTorch ecosystem to the MLIR ecosystem.
FlashMLA: Efficient Multi-head Latent Attention Kernels
Efficient Deep Learning Systems course materials (HSE, YSDA)
This is a list of useful libraries and resources for CUDA development.
A self-learning tutorail for CUDA High Performance Programing.
[ARCHIVED] The C++ parallel algorithms library. See https://github.com/NVIDIA/cccl
Building blocks for foundation models.
CUDA Templates and Python DSLs for High-Performance Linear Algebra
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
The Triton Inference Server provides an optimized cloud and edge inferencing solution.
LLVM (Low Level Virtual Machine) Guide. Learn all about the compiler infrastructure, which is designed for compile-time, link-time, run-time, and "idle-time" optimization of programs. Originally im…