Scalable distributed FFT implementation for heterogeneous CPU/GPU systems, built on Dagger.jl
Figure: Task-scheduled 3D FFT implementation using pencil decomposition, asynchronous transforms, and data movement.
using Pkg
Pkg.add("DaggerFFT")Or from the Julia REPL:
] add DaggerFFTusing DaggerFFT
A = rand(ComplexF64, 128, 128, 128)
F = fft(A; decomp=Pencil(), dims=(1,2,3))
A_recon = ifft(F; decomp=Pencil(), dims=(1,2,3))using DaggerFFT
A = rand(256, 256, 256)
F = rfft(A; decomp=Pencil(), dims=(1,2,3))
A_recon = irfft(F, size(A, 1); decomp=Pencil(), dims=(1,2,3))using DaggerFFT
using CUDA
A = CUDA.rand(ComplexF64, 256, 256, 256)
F = fft(A; decomp=Slab(), dims=(1,2,3))
A_recon = ifft(F; decomp=Slab(), dims=(1,2,3))using DaggerFFT
using FFTW
A = rand(256, 256)
F = fft(A; decomp=Slab(), transforms=(R2R((FFTW.REDFT10, FFTW.REDFT10)),), dims=(1,2))
A_recon = ifft(F; decomp=Slab(), transforms=(R2R((FFTW.REDFT01, FFTW.REDFT01)),), dims=(1,2))