Stars
NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process communication and coordination overheads by allowing programmer…
gpt-oss-120b and gpt-oss-20b are two open-weight language models by OpenAI
Simple high-throughput inference library
DeepEP: an efficient expert-parallel communication library
React Native module for AppZung CodePush
Efficient 2:4 sparse training algorithms and implementations
A library for unit scaling in PyTorch
Tile primitives for speedy kernels
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory…
Hackable and optimized Transformers building blocks, supporting a composable construction.
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
Zero Bubble Pipeline Parallelism
PyTorch code and models for the DINOv2 self-supervised learning method.
neonsecret / xformers
Forked from yocabon/xformersHackable and optimized Transformers building blocks, supporting a composable construction.
AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.
🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.
Submit stacked diffs to GitHub on the command line
Development repository for the Triton language and compiler
Transformer related optimization, including BERT, GPT
Sparsity-aware deep learning inference runtime for CPUs
Libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models