🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
-
Updated
Feb 13, 2026 - Cuda
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
General Matrix Multiplication using NVIDIA Tensor Cores
A benchmarking framework for correlators of FX telescope arrays
High-performance CUDA kernels with step-by-step optimization, profiling, and analysis. A growing collection of GPU solutions demonstrating warp-level tuning, memory optimization, and Tensor Core acceleration.
The MNIST classification problem is a fundamental machine learning task that involves recognizing handwritten digits (0- 9) from a dataset of 70,000 grayscale images (28x28 pixels each). It serves as a benchmark for evaluating machine learning models, particularly neural networks.
INT8 Sparse Tensor Core GEMM for PyTorch — built for Windows
Add a description, image, and links to the tensor-cores topic page so that developers can more easily learn about it.
To associate your repository with the tensor-cores topic, visit your repo's landing page and select "manage topics."