Thanks to visit codestin.com
Credit goes to github.com

Skip to content
View danthe3rd's full-sized avatar

Block or report danthe3rd

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process communication and coordination overheads by allowing programmer…

C++ 362 31 Updated Oct 16, 2025

gpt-oss-120b and gpt-oss-20b are two open-weight language models by OpenAI

Python 18,991 1,886 Updated Oct 23, 2025

A Quirky Assortment of CuTe Kernels

Python 640 53 Updated Oct 11, 2025

Simple high-throughput inference library

Python 149 10 Updated May 14, 2025

DeepEP: an efficient expert-parallel communication library

Cuda 8,667 969 Updated Oct 29, 2025

React Native module for AppZung CodePush

C 101 14 Updated Oct 14, 2025

Tool to help tax declaration of RSUs (French Tax Code)

Python 8 1 Updated Sep 1, 2025

Efficient 2:4 sparse training algorithms and implementations

Python 57 1 Updated Dec 8, 2024

Video stabilization using gyroscope data

Rust 7,897 363 Updated Oct 29, 2025

A library for unit scaling in PyTorch

Jupyter Notebook 132 12 Updated Jul 11, 2025

Tile primitives for speedy kernels

Cuda 2,851 190 Updated Oct 24, 2025

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory…

Python 2,860 533 Updated Oct 28, 2025
Python 1,470 216 Updated Jun 26, 2025

Hackable and optimized Transformers building blocks, supporting a composable construction.

Python 10,042 729 Updated Oct 28, 2025

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Python 922 77 Updated Sep 4, 2024

High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.

Cuda 116 7 Updated Jul 13, 2024

Mamba SSM architecture

Python 16,241 1,475 Updated Oct 10, 2025

Zero Bubble Pipeline Parallelism

Python 433 32 Updated May 7, 2025

David Attenborough narrates your life

Python 4,408 546 Updated Oct 2, 2025

PyTorch code and models for the DINOv2 self-supervised learning method.

Jupyter Notebook 11,792 1,109 Updated Aug 17, 2025

Hackable and optimized Transformers building blocks, supporting a composable construction.

Python 8 Updated Feb 1, 2023

AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.

Python 4,688 383 Updated Oct 27, 2025

🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.

Python 31,424 6,452 Updated Oct 29, 2025

Submit stacked diffs to GitHub on the command line

Python 842 69 Updated Oct 2, 2025

Development repository for the Triton language and compiler

MLIR 17,394 2,346 Updated Oct 29, 2025

AIStore: scalable storage for AI applications

Go 1,613 221 Updated Oct 28, 2025

Transformer related optimization, including BERT, GPT

C++ 6,335 921 Updated Mar 27, 2024

Sparsity-aware deep learning inference runtime for CPUs

Python 3,156 192 Updated Jun 2, 2025

Libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models

Python 2,146 157 Updated Jun 2, 2025
Next