- Bay area, CA
-
13:11
(UTC -07:00)
Lists (1)
Sort Name ascending (A-Z)
Stars
Simple, scalable AI model deployment on GPU clusters
🚀 Awesome System for Machine Learning ⚡️ AI System Papers and Industry Practice. ⚡️ System for Machine Learning, LLM (Large Language Model), GenAI (Generative AI). 🍻 OSDI, NSDI, SIGCOMM, SoCC, MLSy…
A curated list of awesome Recommender System (Books, Conferences, Researchers, Papers, Github Repositories, Useful Sites, Youtube Videos)
Sources for the book "Machine Learning in Production"
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
⚡ Serverless Framework – Effortlessly build apps that auto-scale, incur zero costs when idle, and require minimal maintenance using AWS Lambda and other managed cloud services.
TransMLA: Multi-Head Latent Attention Is All You Need (NeurIPS 2025 Spotlight)
The simplest, fastest repository for training/finetuning medium-sized GPTs.
A course of learning LLM inference serving on Apple Silicon for systems engineers: build a tiny vLLM + Qwen.
Machine Learning Engineering Open Book
Achieve state of the art inference performance with modern accelerators on Kubernetes
An AI-powered task-management system you can drop into Cursor, Lovable, Windsurf, Roo, and others.
📄 Configuration files that enhance Cursor AI editor experience with custom rules and behaviors
Postman & Chatbot Arena for inference benchmarking.
Analyze computation-communication overlap in V3/R1.
An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Run the latest LLMs and VLMs across GPU, NPU, and CPU with PC (Python/C++) & mobile (Android & iOS) support, running quickly with OpenAI gpt-oss, Granite4, Qwen3VL, Gemma 3n and more.
Dria SDK is for building and executing synthetic data generation pipelines on Dria Knowledge Network.
Maid is a cross-platform Flutter app for interfacing with GGUF / llama.cpp models locally, and with Ollama and OpenAI models remotely.
Jan is an open source alternative to ChatGPT that runs 100% offline on your computer.
[MICRO'23, MLSys'22] TorchSparse: Efficient Training and Inference Framework for Sparse Convolution on GPUs.
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…