Thanks to visit codestin.com
Credit goes to Github.com

Skip to content
View Kaweees's full-sized avatar

Organizations

@hackclub @CAMSCSC @CalPolyVEX @687vex @cpspacesystems @Cal-Poly-RAMP @dimensionalOS @PolyUAS @smolgpu

Block or report Kaweees

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
35 stars written in Cuda
Clear filter

LLM training in simple, raw C/CUDA

Cuda 28,968 3,404 Updated Jun 26, 2025

Instant neural graphics primitives: lightning fast NeRF and more

Cuda 17,285 2,054 Updated Feb 2, 2026

DeepEP: an efficient expert-parallel communication library

Cuda 9,000 1,108 Updated Feb 9, 2026

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 6,197 825 Updated Feb 25, 2026

Tile primitives for speedy kernels

Cuda 3,193 246 Updated Feb 24, 2026

This package contains the original 2012 AlexNet code.

Cuda 2,834 366 Updated Mar 12, 2025

how to optimize some algorithm in cuda.

Cuda 2,825 257 Updated Feb 15, 2026

Sample codes for my CUDA programming book

Cuda 2,010 383 Updated Dec 14, 2025

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 1,084 110 Updated Dec 30, 2024

Fast CUDA matrix multiplication from scratch

Cuda 1,065 162 Updated Sep 2, 2025

Training materials associated with NVIDIA's CUDA Training Series (www.olcf.ornl.gov/cuda-training-series/)

Cuda 942 349 Updated Aug 19, 2024

Code from the "CUDA Crash Course" YouTube series by CoffeeBeforeArch

Cuda 942 179 Updated Jul 19, 2023

A CUDA implementation of SIFT for NVidia GPUs (1.2 ms on a GTX 1060)

Cuda 935 298 Updated Oct 1, 2025

Examples demonstrating available options to program multiple GPUs in a single node or a cluster

Cuda 869 147 Updated Sep 26, 2025

GPU accelerated decision optimization

Cuda 721 127 Updated Feb 27, 2026

UNet diffusion model in pure CUDA

Cuda 657 31 Updated Jun 28, 2024

CUDA Learning guide

Cuda 531 63 Updated Jun 20, 2024
Cuda 454 79 Updated Dec 18, 2025

Learnings and programs related to CUDA

Cuda 433 20 Updated Jun 29, 2025

Step-by-step optimization of CUDA SGEMM

Cuda 433 57 Updated Mar 30, 2022

EGGROLL in C, integer-first training

Cuda 343 31 Updated Dec 22, 2025

CUDA Matrix Multiplication Optimization

Cuda 261 24 Updated Jul 19, 2024

Parrot is a C++ library for fused array operations using CUDA/Thrust. It provides efficient GPU-accelerated operations with lazy evaluation semantics, allowing for chaining of operations without un…

Cuda 248 15 Updated Jan 29, 2026

A CUDA reimplementation of the line/plane odometry of LIO-SAM. A point cloud hash map (inspired by iVox of Faster-LIO) on GPU is used to accelerate 5-neighbour KNN search. Run on Jetson Orin NX 8GB.

Cuda 181 23 Updated Aug 24, 2025

Some CUDA example code with READMEs.

Cuda 179 27 Updated Nov 11, 2025

High-Performance FP32 GEMM on CUDA devices

Cuda 117 8 Updated Jan 21, 2025
Next