- San Francisco, CA
-
23:07
(UTC -08:00) - https://miguelvf.com
- in/miguel-vf
- @kaweees
Highlights
Lists (16)
Sort Name ascending (A-Z)
Stars
- All languages
- AGS Script
- AMPL
- ActionScript
- Agda
- Assembly
- Astro
- Batchfile
- BitBake
- Boogie
- Brainfuck
- C
- C#
- C++
- CMake
- COBOL
- CSS
- Chapel
- Clojure
- CoffeeScript
- Common Lisp
- Coq
- Cuda
- Cython
- Dart
- Dockerfile
- Eagle
- Elixir
- Elm
- Emacs Lisp
- Erlang
- Fennel
- G-code
- GDScript
- GLSL
- GSC
- Game Maker Language
- Git Attributes
- Go
- Groovy
- HTML
- Handlebars
- Haskell
- Haxe
- HolyC
- INI
- Inno Setup
- JSON
- Janet
- Java
- JavaScript
- JetBrains MPS
- Jinja
- Julia
- Jupyter Notebook
- Just
- KiCad Layout
- Kit
- Kotlin
- Lean
- Lua
- M
- MATLAB
- MDX
- MLIR
- Makefile
- Markdown
- Mathematica
- Mercury
- Metal
- Mojo
- Mustache
- NSIS
- Nim
- Nix
- Nunjucks
- Nushell
- OCaml
- Objective-C
- Objective-C++
- OpenSCAD
- PHP
- Perl
- PostScript
- PowerShell
- Processing
- Prolog
- PureScript
- Python
- Q#
- QML
- QuickBASIC
- Racket
- ReScript
- Rocq Prover
- Roff
- Ruby
- Rust
- SCSS
- SWIG
- Sass
- Scala
- Scheme
- Shell
- Smalltalk
- Solidity
- Starlark
- Svelte
- Swift
- SystemVerilog
- Tcl
- TeX
- Text
- TypeScript
- Typst
- V
- VHDL
- Verilog
- Vim Script
- Vim Snippet
- Visual Basic
- Visual Basic .NET
- Vue
- WebAssembly
- XSLT
- YARA
- Zig
- hoon
- mdsvex
Instant neural graphics primitives: lightning fast NeRF and more
DeepEP: an efficient expert-parallel communication library
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
Tile primitives for speedy kernels
This package contains the original 2012 AlexNet code.
how to optimize some algorithm in cuda.
Sample codes for my CUDA programming book
Flash Attention in ~100 lines of CUDA (forward pass only)
Fast CUDA matrix multiplication from scratch
Training materials associated with NVIDIA's CUDA Training Series (www.olcf.ornl.gov/cuda-training-series/)
Code from the "CUDA Crash Course" YouTube series by CoffeeBeforeArch
A CUDA implementation of SIFT for NVidia GPUs (1.2 ms on a GTX 1060)
Examples demonstrating available options to program multiple GPUs in a single node or a cluster
Step-by-step optimization of CUDA SGEMM
CUDA Matrix Multiplication Optimization
Parrot is a C++ library for fused array operations using CUDA/Thrust. It provides efficient GPU-accelerated operations with lazy evaluation semantics, allowing for chaining of operations without un…
A CUDA reimplementation of the line/plane odometry of LIO-SAM. A point cloud hash map (inspired by iVox of Faster-LIO) on GPU is used to accelerate 5-neighbour KNN search. Run on Jetson Orin NX 8GB.