MoMA

Multi-word modular arithmetic (MoMA) decomposes large bit-width integer arithmetic into machine-word-based operations. We implemented MoMA as a rewrite system using SPIRAL as an extension of the SPIRAL NTTX package. For more details, please view the full paper here.

Dependencies

nvcc >= 11.7
nsys >= 2022.4.2
python3 >= 3.8.6
spiral >= 8.5.0*

*not required if you plan to use only pre-generated code

Build MoMA

SPIRAL Installation

Install SPIRAL following the instructions in this guide and ensure it passes all basic tests on your platform by running make test in the build directory.

Link this repository with SPIRAL

In the spiral-software/namespaces/packages/ subdirectory of your SPIRAL installation tree, clone this repository and its dependencies using the following commands:

git clone https://github.com/Naifeng/moma.git nttx
git clone -b develop https://github.com/Naifeng/spiral-package-fftx.git fftx
git clone -b develop https://github.com/Naifeng/spiral-package-simt.git simt

The packages subdirectory should then look like this:

packages
└────fftx
└────simt
└────nttx (this repository)
     │    README.md (this file)
     │    ...

Usage

Once SPIRAL is installed and linked with this repository, run the following command in the nttx/cuda/cuda-test/ directory to benchmark MoMA-based number theoretic transforms (NTTs) for a specific input bit-width:

bash ./benchmark.sh -d <input_bit_width>

Supported input bit-widths are 128, 256, 384, and 768.

If you are running the code in this repository on one of the three platforms detailed in the paper (H100, V100, RTX 4090), you can enable platform-specific performance tuning by using the -p option. For example, to benchmark MoMA-based NTTs on an H100, run:

bash ./benchmark.sh -d 128 -p h100

You can type bash ./benchmark.sh -h for a detailed description of the available options. You can also check the correctness and profiling information of the NTT code by inspecting cuda/cuda-test/log.txt during the benchmarking process. Note that for each batch size, the log.txt file will overwrite the previous one.

Sample output

     ================================================================================
     Results
     ================================================================================
     NTT size [log2]     Runtime per butterfly [ns]    Runtime per NTT [ns]          
     8                   0.010                         11                            
     9                   0.011                         25                            
     10                  0.012                         60                            
     11                  0.023                         256                           
     12                  0.015                         372                           
     13                  0.017                         880                           
     14                  0.014                         1623                          
     15                  0.013                         3220                          
     16                  0.013                         6806                          
     17                  0.013                         14645                         
     18                  0.014                         31897                         
     19                  0.014                         68754                         
     20                  0.015                         155018                        
     21                  0.015                         324608                        
     22                  0.014                         663467

Performance tuning

For optimal NTT performance, platform-specific tuning can be performed within the code generation pass. We have incorporated known performance tuning information for V100, H100, and RTX 4090 into the SPIRAL code generation pass as a knowledge base. Users can enable this tuning using the -p option with benchmark.sh. For other platforms, the -p option can be omitted, which will default to -p general.

In the future, we plan to automate the construction of the knowledge base, allowing SPIRAL to profile the target platform and automatically derive the performance tuning information. This can be achieved by integrating MoMA with the SPIRAL profiler. It is noteworthy that even without platform-specific tuning, MoMA-based NTT achieves near-ASIC performance on commodity GPUs.

Benchmark BLAS operations

Some manual editing is required to benchmark BLAS operations, in cuda/cuda-test/:

Set benchmark_blas to true in benchmark.sh.
Set blas_op to one of vvadd, vvsub, vvmul, and axpy in benchmark.sh.

You can now benchmark any supported BLAS operation by running the following command in the cuda/cuda-test directory:

bash ./benchmark.sh -d <input_bit_width>

Supported input bit-widths are 128, 256, 512, and 1024.

For example, to benchmark a 512-bit axpy operation using the above command, set blas_op to axpy in benchmark.sh and run:

bash ./benchmark.sh -d 512

Remember to set benchmark_blas to false in benchmark.sh before benchmarking NTTs.

Sample output

     ================================================================================
     Results
     ================================================================================
     Vector size [log2]  Runtime per element [ns]      Runtime per operation [ns]    
     8                   0.805                         206

Using pre-generated code

This repository includes select pieces of SPIRAL-generated code, allowing you to evaluate MoMA-based NTTs and BLAS operations even without SPIRAL installed. Please refer to the README files in cuda/cuda-test/ntt_code and cuda/cuda-test/blas_code for more information.

Directory Overview

nttx (this repository)
│    README.md (this file)
│    opts.gi    
│    ...
│
└────examples
│    │    mp-cuda-batch.g
│    │    mp-py.g
│ 
└────cuda
     │    init.g
     │    ...
     │
     └────cuda-test
          │    benchmark.sh
          │    ...
          │ 
          └────blas_code
          └────ntt_code

License

Distributed under the SPIRAL License. For more details, please refer to the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
cuda		cuda
examples		examples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
breakdown.gi		breakdown.gi
code.gi		code.gi
codegen.gi		codegen.gi
init.g		init.g
mpmethod.gi		mpmethod.gi
mxpmethod.gi		mxpmethod.gi
nonterms.gi		nonterms.gi
opts.gi		opts.gi
rewrite.gi		rewrite.gi
sigmaspl.gi		sigmaspl.gi
types.gi		types.gi
unparser.gi		unparser.gi

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MoMA

Dependencies

Build MoMA

SPIRAL Installation

Link this repository with SPIRAL

Usage

Performance tuning

Benchmark BLAS operations

Using pre-generated code

Directory Overview

License

About

Uh oh!

Releases

Packages

Languages

License

Naifeng/moma

Folders and files

Latest commit

History

Repository files navigation

MoMA

Dependencies

Build MoMA

SPIRAL Installation

Link this repository with SPIRAL

Usage

Performance tuning

Benchmark BLAS operations

Using pre-generated code

Directory Overview

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages