Accelerating Fast Fourier Transform

A project under the Research Experiences in Computational Science, Engineering, and Mathematics (RECSEM), 2018. We designed and implemented a mixed-precision algorithm that utilizes the tensor cores to accelerate fast Fourier transform.

./C

Contains some C code implementing matrix data struture and a test program. It is used to simplify operations in earlier versions, and is removed in our final implementation.

./CUDA

Contains most of the project codes. For the majority of programs, compile them using command: nvcc -o output filename -lcufft -lcublas. Make sure you are using CUDA 9.2 or above. Some programs may need modification to adapt to location change of header files.

The files in alternative/ are experimental programs that try to optimize the implementation by using the static splitting method or by calling magma library. These trails are not adopted in our final implementation.
cublas_gemm_batchstride_test.cu is a program to test cublasGemmStridedBatchedEx before we added it into our main implementation.
cuFFT_2D_test.cu is testing 2D FFT in single and half precision type, using cuFFT library.
cuFFT_32_16_compare_test.cu is testing 1D cuFFT in single and half precision type, using cuFFT library.
The filenames starting with debug_ indicate that it's a temporary file for debugging.
fft2_fp32_multiple.cu is an attempt to implement radix-2 fft. It is not used in final implmentation.
The series of files starting with fft4_ are step-by-step implementation of the base case of radix-4 fft. We first implemented the single precision 1 vector version (fft_fp32_vector.cu), then added in batch execution support (fft4_fp32_multiple.cu). After that we included the splitting and performed multiplication in half precision (fft4_fp16.cu). Later we added fft4_improved_fp16.cu, which uses unified memory.
The files with name starting by gfft_ are implementation of general input fft (complete fft instead of the base case). The first runnable implementation is gfft_using_fft4.cu. Then we merged it with splitting and fft4 files and got gfft_combine.cu. Then several optimizations were performed: we made accumulation, splitting, twiddle multiplication parallel; we wrote customized transpose kernel; we also wrote a version using magma tranpose API, but did not use it in final evaluation. GPU_run_test.cu and CPU_run_test_32.cu are final version to be tested with nvprof.
helper/ contains matrix, vector, and constants definition. It also contains two programs demonstrating how to use matrix and vector. However, we eliminates the matrix and vector wrapper in the final implementation.
minimal_example_cublasgemmstridebatched.cu is a complete and minimal example that shows the batch size limit of the cublasgemmstridebatched function. We find that under our setting, cublasgemmstridebatched returns error if batch size is greater than 65534.
There are two files in nvidia_helper/, which are copied from nvidia sample code. checkCudaErrors.h defines checkCudaErrors(), which is used when calling most CUDA functions. helper_string.h allows us to get command line parameters.
The files with name starting by split_ are step-by-step implementation of the dynamic splitting algorithm. The pseudo code can be found in our extended abstract. The order is: split_one_number.cu -> split_one_vector.cu -> split_one_vector_using_vector.cu -> split_multiple_vectors_using_matrix.cu. They are CPU code, which are different from our final GPU kernel implementation.
test_performance/ contains programs that compare speed and accuracy of implementations. Most names are self-explainable. The results presented in our paper is generated by improved_gfft_and_cufft.cu. The results in 2D FFT poster are from 2d_gfft_and_cufft.cu.
timing_my_transpose.cu is a complete gfft program, but it calculates and prints the time cost by calling self-implemented transpose kernel at each level.
util/ contains header files written by us that are included in testing programs. my_include_combined includes general header files like stdio and CUDA runtime. my_include is an earlier version that also includes splitting header, fft4 header, etc. Most code are converted from .cu implementation. 32_gfft, which implements cooley–Tukey algorithm without splitting, is to evaluate splitting benefits and overhead. Other files are explained through their names.

./Experiment

Contains codes that generate experimental results for paper writing. checkCudaErrors.h and helper_string.h contain helper functions. cufft_fail_to_speedup/ contains experiment code and results to show cuFFT does not attain the same level of acceleration as cuBLAS.

./matlab

Contains matlab simulation programs. We write matlab codes to make sure the algorithm is correct before starting CUDA implementation. Most of the codes are provided by Dr. D'Azevedo.

./performance_result

Contains testing results of accuracy or speed. The filename starting with nvprof_ indicate that they are generated by nvprof profiler. The names generally indicate the corresponding programs in ./CUDA, and the test configurations are specified before performance results.

event_timing_my_transpose.txt records the time cost by calling self-implemented transpose kernel. The other two files with name ended with _wrong are faulty results as the errorr are not initialized with 0.

./reference

Contains useful CUDA codes provided by NVIDIA. cuda_fp16 is the definition of half precision type. helper_cuda contains useful helper functions, and is included in our implementation programs. The other directories are copied from CUDA samples, which helped us get familiar with CUDA programming.

./result

Contains some results printed by programs in ./CUDA. The filename usually indicates the corresponding program. We compared these results with matlab fft output to verify the correctness and accuracy of computation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Accelerating Fast Fourier Transform

./C

./CUDA

./Experiment

./matlab

./performance_result

./reference

./result

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
C		C
CUDA		CUDA
experiment		experiment
matlab		matlab
performance_result		performance_result
reference		reference
result		result
.gitignore		.gitignore
README.md		README.md

xcheng98/REU_FFT

Folders and files

Latest commit

History

Repository files navigation

Accelerating Fast Fourier Transform

./C

./CUDA

./Experiment

./matlab

./performance_result

./reference

./result

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages