High performance discrete Fourier transforms on graphics processors

1 November 2008

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 1-12
https://doi.org/10.1109/sc.2008.5213922

Abstract

We present novel algorithms for computing discrete Fourier transforms with high performance on GPUs. We present hierarchical, mixed radix FFT algorithms for both power-of-two and non-power-of-two sizes. Our hierarchical FFT algorithms efficiently exploit shared memory on GPUs using a Stockham formulation. We reduce the memory transpose overheads in hierarchical algorithms by combining the transposes into a block-based multi-FFT algorithm. For non-power-of-two sizes, we use a combination of mixed radix FFTs of small primes and Bluestein's algorithm. We use modular arithmetic in Bluestein's algorithm to improve the accuracy. We implemented our algorithms using the NVIDIA CUDA API and compared their performance with NVIDIA's CUFFT library and an optimized CPU-implementation (Intel's MKL) on a high-end quad-core CPU. On an NVIDIA GPU, we obtained performance of up to 300 GFlops, with typical performance improvements of 2-4times over CUFFT and 8-40times improvement over MKL for large sizes.

Keywords

This publication has 6 references indexed in Scilit:

FFTC: Fastest Fourier Transform for the IBM Cell Broadband Engine
Published by Springer Science and Business Media LLC ,2008
A Survey of General‐Purpose Computation on Graphics Hardware
Computer Graphics Forum, 2007
The potential of the cell processor for scientific computing
Published by Association for Computing Machinery (ACM) ,2006
Memory---A memory model for scientific algorithms on graphics processors
Published by Association for Computing Machinery (ACM) ,2006
Computational Frameworks for the Fast Fourier Transform
Published by Society for Industrial & Applied Mathematics (SIAM) ,1992
FFTs in external or hierarchical memory
The Journal of Supercomputing, 1990

Cited by 134 articles