Multi-word modular arithmetic (MoMA) decomposes large bit-width integer arithmetic into machine-word-based operations. We implemented MoMA as a rewrite system using SPIRAL as an extension of the SPIRAL NTTX package. For more details, please view the full paper here.
*not required if you plan to use only pre-generated code
Install SPIRAL following the instructions in this guide and ensure it passes all basic tests on your platform by running make test
in the build
directory.
In the spiral-software/namespaces/packages/
subdirectory of your SPIRAL installation tree, clone this repository and its dependencies using the following commands:
git clone https://github.com/Naifeng/moma.git nttx
git clone -b develop https://github.com/Naifeng/spiral-package-fftx.git fftx
git clone -b develop https://github.com/Naifeng/spiral-package-simt.git simt
The packages
subdirectory should then look like this:
packages
└────fftx
└────simt
└────nttx (this repository)
│ README.md (this file)
│ ...
Once SPIRAL is installed and linked with this repository, run the following command in the nttx/cuda/cuda-test/
directory to benchmark MoMA-based number theoretic transforms (NTTs) for a specific input bit-width:
bash ./benchmark.sh -d <input_bit_width>
Supported input bit-widths are 128, 256, 384, and 768.
If you are running the code in this repository on one of the three platforms detailed in the paper (H100, V100, RTX 4090), you can enable platform-specific performance tuning by using the -p
option. For example, to benchmark MoMA-based NTTs on an H100, run:
bash ./benchmark.sh -d 128 -p h100
You can type bash ./benchmark.sh -h
for a detailed description of the available options. You can also check the correctness and profiling information of the NTT code by inspecting cuda/cuda-test/log.txt
during the benchmarking process. Note that for each batch size, the log.txt
file will overwrite the previous one.
Sample output
-
The output will be displayed in the terminal window after the benchmark script finishes running. The following is a sample output obtained on an H100 by running the command above:
================================================================================ Results ================================================================================ NTT size [log2] Runtime per butterfly [ns] Runtime per NTT [ns] 8 0.010 11 9 0.011 25 10 0.012 60 11 0.023 256 12 0.015 372 13 0.017 880 14 0.014 1623 15 0.013 3220 16 0.013 6806 17 0.013 14645 18 0.014 31897 19 0.014 68754 20 0.015 155018 21 0.015 324608 22 0.014 663467
For optimal NTT performance, platform-specific tuning can be performed within the code generation pass. We have incorporated known performance tuning information for V100, H100, and RTX 4090 into the SPIRAL code generation pass as a knowledge base. Users can enable this tuning using the -p
option with benchmark.sh
. For other platforms, the -p
option can be omitted, which will default to -p general
.
In the future, we plan to automate the construction of the knowledge base, allowing SPIRAL to profile the target platform and automatically derive the performance tuning information. This can be achieved by integrating MoMA with the SPIRAL profiler. It is noteworthy that even without platform-specific tuning, MoMA-based NTT achieves near-ASIC performance on commodity GPUs.
Some manual editing is required to benchmark BLAS operations, in cuda/cuda-test/
:
- Set
benchmark_blas
totrue
inbenchmark.sh
. - Set
blas_op
to one ofvvadd
,vvsub
,vvmul
, andaxpy
inbenchmark.sh
.
You can now benchmark any supported BLAS operation by running the following command in the cuda/cuda-test
directory:
bash ./benchmark.sh -d <input_bit_width>
Supported input bit-widths are 128, 256, 512, and 1024.
For example, to benchmark a 512-bit axpy
operation using the above command, set blas_op
to axpy
in benchmark.sh
and run:
bash ./benchmark.sh -d 512
Remember to set benchmark_blas
to false
in benchmark.sh
before benchmarking NTTs.
Sample output
-
The sample output on a V100 is as follows:
================================================================================ Results ================================================================================ Vector size [log2] Runtime per element [ns] Runtime per operation [ns] 8 0.805 206
This repository includes select pieces of SPIRAL-generated code, allowing you to evaluate MoMA-based NTTs and BLAS operations even without SPIRAL installed. Please refer to the README
files in cuda/cuda-test/ntt_code
and cuda/cuda-test/blas_code
for more information.
nttx (this repository)
│ README.md (this file)
│ opts.gi
│ ...
│
└────examples
│ │ mp-cuda-batch.g
│ │ mp-py.g
│
└────cuda
│ init.g
│ ...
│
└────cuda-test
│ benchmark.sh
│ ...
│
└────blas_code
└────ntt_code
Distributed under the SPIRAL License. For more details, please refer to the LICENSE
file.