HyperLink   Auto-tuning of Fast Fourier Transform on Graphics Processors
Publication Year:
  Yuri Dotsenko, Sara Sadeghi Baghsorkhi, Brandon Lloyd, Naga Govindaraju
  Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Paral lel Programming (PPoPP), Feb. 2011

We present an auto-tuning framework for FFTs on graphics processors
(GPUs). Due to complex design of the memory and compute subsystems on GPUs, the performance of FFT kernels over the range of possible input parameters can vary widely. We generate several variants for each component of the FFT kernel that, for different cases, are likely to perform well. Our auto-tuner composes variants to generate kernels and selects the best ones. We present heuristics to prune the search space and profile only a small fraction of all possible kernels. We compose optimized kernels to
improve the performance of larger FFT computations. We implement
the system using the NVIDIA CUDA API and compare its performance to the state-of-the-art FFT libraries. On a range of NVIDIA GPUs and input sizes, our auto-tuned FFTs outperform the NVIDIA CUFFT 3.0 library by up to 38× and deliver up to 3×higher performance compared to a manually-tuned FFT.