- Belmont, CA
Stars
Ready-to-use ML training recipes to help you build and deploy models on Baseten.
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
FlashInfer: Kernel Library for LLM Serving
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
A throughput-oriented high-performance serving framework for LLMs
PyTorch native quantization and sparsity for training and inference
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Training
Entropy Based Sampling and Parallel CoT Decoding
Any model. Any hardware. Zero compromise. Built with @ziglang / @openxla / MLIR / @bazelbuild
📰 Must-read papers and blogs on Speculative Decoding ⚡️
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
A guidance language for controlling large language models.
Tips and tricks for working with Large Language Models like OpenAI's GPT-4.
Port of OpenAI's Whisper model in C/C++
A collection of libraries to optimise AI model performances
AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.
RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing …
An R package implementing the UMAP dimensionality reduction method.
A library for efficient similarity search and clustering of dense vectors.
CUDA-accelerated GIS and spatiotemporal algorithms
Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more