Multi-word modular arithmetic (MoMA) decomposes large bit-width integer arithmetic into machine-word-based operations. We implemented MoMA as a rewrite system using SPIRAL as an extension of the SPIRAL NTTX package. For more details, please view the full paper here.
*not required if you plan to use only pre-generated code
Install SPIRAL following the instructions in this guide and ensure it passes all basic tests on your platform by running make test in the build directory.
In the spiral-software/namespaces/packages/ subdirectory of your SPIRAL installation tree, clone this repository and its dependencies using the following commands:
git clone https://github.com/Naifeng/moma.git nttx
git clone -b develop https://github.com/Naifeng/spiral-package-fftx.git fftx
git clone -b develop https://github.com/Naifeng/spiral-package-simt.git simt
The packages subdirectory should then look like this:
packages
└────fftx
└────simt
└────nttx (this repository)
│ README.md (this file)
│ ...
Once SPIRAL is installed and linked with this repository, run the following command in the nttx/cuda/cuda-test/ directory to benchmark MoMA-based number theoretic transforms (NTTs) for a specific input bit-width:
bash ./benchmark.sh -d <input_bit_width>
Supported input bit-widths are 128, 256, 384, and 768.
If you are running the code in this repository on one of the three platforms detailed in the paper (H100, V100, RTX 4090), you can enable platform-specific performance tuning by using the -p option. For example, to benchmark MoMA-based NTTs on an H100, run:
bash ./benchmark.sh -d 128 -p h100
You can type bash ./benchmark.sh -h for a detailed description of the available options. You can also check the correctness and profiling information of the NTT code by inspecting cuda/cuda-test/log.txt during the benchmarking process. Note that for each batch size, the log.txt file will overwrite the previous one.
Sample output
-
The output will be displayed in the terminal window after the benchmark script finishes running. The following is a sample output obtained on an H100 by running the command above:
================================================================================
Results
================================================================================
NTT size [log2] Runtime per butterfly [ns] Runtime per NTT [ns]
8 0.010 11
9 0.011 25
10 0.012 60
11 0.023 256
12 0.015 372
13 0.017 880
14 0.014 1623
15 0.013 3220
16 0.013 6806
17 0.013 14645
18 0.014 31897
19 0.014 68754
20 0.015 155018
21 0.015 324608
22 0.014 663467
For optimal NTT performance, platform-specific tuning can be performed within the code generation pass. We have incorporated known performance tuning information for V100, H100, and RTX 4090 into the SPIRAL code generation pass as a knowledge base. Users can enable this tuning using the -p option with benchmark.sh. For other platforms, the -p option can be omitted, which will default to -p general.
In the future, we plan to automate the construction of the knowledge base, allowing SPIRAL to profile the target platform and automatically derive the performance tuning information. This can be achieved by integrating MoMA with the SPIRAL profiler. It is noteworthy that even without platform-specific tuning, MoMA-based NTT achieves near-ASIC performance on commodity GPUs.
Some manual editing is required to benchmark BLAS operations, in cuda/cuda-test/:
- Set
benchmark_blastotrueinbenchmark.sh. - Set
blas_opto one ofvvadd,vvsub,vvmul, andaxpyinbenchmark.sh.
You can now benchmark any supported BLAS operation by running the following command in the cuda/cuda-test directory:
bash ./benchmark.sh -d <input_bit_width>
Supported input bit-widths are 128, 256, 512, and 1024.
For example, to benchmark a 512-bit axpy operation using the above command, set blas_op to axpy in benchmark.sh and run:
bash ./benchmark.sh -d 512
Remember to set benchmark_blas to false in benchmark.sh before benchmarking NTTs.
Sample output
-
The sample output on a V100 is as follows:
================================================================================
Results
================================================================================
Vector size [log2] Runtime per element [ns] Runtime per operation [ns]
8 0.805 206
This repository includes select pieces of SPIRAL-generated code, allowing you to evaluate MoMA-based NTTs and BLAS operations even without SPIRAL installed. Please refer to the README files in cuda/cuda-test/ntt_code and cuda/cuda-test/blas_code for more information.
nttx (this repository)
│ README.md (this file)
│ opts.gi
│ ...
│
└────examples
│ │ mp-cuda-batch.g
│ │ mp-py.g
│
└────cuda
│ init.g
│ ...
│
└────cuda-test
│ benchmark.sh
│ ...
│
└────blas_code
└────ntt_code
Distributed under the SPIRAL License. For more details, please refer to the LICENSE file.