Stars
tiktoken is a fast BPE tokeniser for use with OpenAI's models.
A high-throughput and memory-efficient inference and serving engine for LLMs
cudnn_frontend provides a c++ wrapper for the cudnn backend API and samples on how to use it
Fast and memory-efficient exact attention
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
Lean Algorithmic Trading Engine by QuantConnect (Python, C#)
The Triton backend for TensorRT.
NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
Triton backend that enables pre-process, post-processing and other logic to be implemented in Python.
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
📄 Configuration files that enhance Cursor AI editor experience with custom rules and behaviors
Enhanced MCP server for interactive user feedback and command execution in AI-assisted development, featuring dual interface support (Web UI and Desktop Application) with intelligent environment de…
Rules and Knowledge to work better with agents such as Claude Code or Cursor
A Lucene codec for vector search and clustering on the GPU
XLA Launcher is a high-performance, lightweight C++ library designed to provide a simple interface for loading and executing computation graphs represented in the StableHLO format.
Dockerfile templates for creating RAPIDS Docker Images
Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ
A distributed high-performance dynamic lookuptable-style Embedding designed for recommendation, search, CTR and advertising systems. Supports GPU, CPU, remote distributed KV (such as Redis), SSD, a…
A std::execution style runtime context and High Performance RPC Transport for using OpenUCX. Including CUDA/ROCM/... devices with RDMA.
A minimal GPU design in Verilog to learn how GPUs work from the ground up
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…
Implementation of bitonic mergesort for the GPU
Distributed transactional key-value database, originally created to complement TiDB
A Datacenter Scale Distributed Inference Serving Framework
NeMo Retriever extraction is a scalable, performance-oriented document content and metadata extraction microservice. NeMo Retriever extraction uses specialized NVIDIA NIM microservices to find, con…
A probabilistic programming library for Bayesian deep learning, generative models, based on Tensorflow
SGLang is a high-performance serving framework for large language models and multimodal models.
Milvus is a high-performance, cloud-native vector database built for scalable vector ANN search