A project under the Research Experiences in Computational Science, Engineering, and Mathematics (RECSEM), 2018. We designed and implemented a mixed-precision algorithm that utilizes the tensor cores to accelerate fast Fourier transform.
Contains some C code implementing matrix data struture and a test program. It is used to simplify operations in earlier versions, and is removed in our final implementation.
Contains most of the project codes. For the majority of programs, compile them using command: nvcc -o output filename -lcufft -lcublas. Make sure you are using CUDA 9.2 or above. Some programs may need modification to adapt to location change of header files.
-
The files in
alternative/are experimental programs that try to optimize the implementation by using the static splitting method or by calling magma library. These trails are not adopted in our final implementation. -
cublas_gemm_batchstride_test.cuis a program to testcublasGemmStridedBatchedExbefore we added it into our main implementation. -
cuFFT_2D_test.cuis testing 2D FFT in single and half precision type, using cuFFT library. -
cuFFT_32_16_compare_test.cuis testing 1D cuFFT in single and half precision type, using cuFFT library. -
The filenames starting with
debug_indicate that it's a temporary file for debugging. -
fft2_fp32_multiple.cuis an attempt to implement radix-2 fft. It is not used in final implmentation. -
The series of files starting with
fft4_are step-by-step implementation of the base case of radix-4 fft. We first implemented the single precision 1 vector version (fft_fp32_vector.cu), then added in batch execution support (fft4_fp32_multiple.cu). After that we included the splitting and performed multiplication in half precision (fft4_fp16.cu). Later we addedfft4_improved_fp16.cu, which uses unified memory. -
The files with name starting by
gfft_are implementation of general input fft (complete fft instead of the base case). The first runnable implementation isgfft_using_fft4.cu. Then we merged it with splitting and fft4 files and gotgfft_combine.cu. Then several optimizations were performed: we made accumulation, splitting, twiddle multiplication parallel; we wrote customized transpose kernel; we also wrote a version using magma tranpose API, but did not use it in final evaluation.GPU_run_test.cuandCPU_run_test_32.cuare final version to be tested with nvprof. -
helper/contains matrix, vector, and constants definition. It also contains two programs demonstrating how to use matrix and vector. However, we eliminates the matrix and vector wrapper in the final implementation. -
minimal_example_cublasgemmstridebatched.cuis a complete and minimal example that shows the batch size limit of the cublasgemmstridebatched function. We find that under our setting, cublasgemmstridebatched returns error if batch size is greater than 65534. -
There are two files in
nvidia_helper/, which are copied from nvidia sample code.checkCudaErrors.hdefinescheckCudaErrors(), which is used when calling most CUDA functions.helper_string.hallows us to get command line parameters. -
The files with name starting by
split_are step-by-step implementation of the dynamic splitting algorithm. The pseudo code can be found in our extended abstract. The order is:split_one_number.cu->split_one_vector.cu->split_one_vector_using_vector.cu->split_multiple_vectors_using_matrix.cu. They are CPU code, which are different from our final GPU kernel implementation. -
test_performance/contains programs that compare speed and accuracy of implementations. Most names are self-explainable. The results presented in our paper is generated byimproved_gfft_and_cufft.cu. The results in 2D FFT poster are from2d_gfft_and_cufft.cu. -
timing_my_transpose.cuis a complete gfft program, but it calculates and prints the time cost by calling self-implemented transpose kernel at each level. -
util/contains header files written by us that are included in testing programs.my_include_combinedincludes general header files like stdio and CUDA runtime.my_includeis an earlier version that also includes splitting header, fft4 header, etc. Most code are converted from.cuimplementation.32_gfft, which implements cooley–Tukey algorithm without splitting, is to evaluate splitting benefits and overhead. Other files are explained through their names.
Contains codes that generate experimental results for paper writing. checkCudaErrors.h and helper_string.h contain helper functions. cufft_fail_to_speedup/ contains experiment code and results to show cuFFT does not attain the same level of acceleration as cuBLAS.
Contains matlab simulation programs. We write matlab codes to make sure the algorithm is correct before starting CUDA implementation. Most of the codes are provided by Dr. D'Azevedo.
Contains testing results of accuracy or speed. The filename starting with nvprof_ indicate that they are generated by nvprof profiler. The names generally indicate the corresponding programs in ./CUDA, and the test configurations are specified before performance results.
event_timing_my_transpose.txt records the time cost by calling self-implemented transpose kernel. The other two files with name ended with _wrong are faulty results as the errorr are not initialized with 0.
Contains useful CUDA codes provided by NVIDIA. cuda_fp16 is the definition of half precision type. helper_cuda contains useful helper functions, and is included in our implementation programs. The other directories are copied from CUDA samples, which helped us get familiar with CUDA programming.
Contains some results printed by programs in ./CUDA. The filename usually indicates the corresponding program. We compared these results with matlab fft output to verify the correctness and accuracy of computation.