Exploring Data Streaming to Improve 3D FFT Implementation on Multiple GPUs

2010 
FFT is a well known and widely used algorithm in many scientific and engineering applications. However, FFT is a memory-bound problem that still presents performance challenges to new generations of computer architectures due to its relatively low ratio of computation per memory access. For GPU architectures, where the data transfers between the host CPU memory and the device memory is very expensive, the memory overhead can become a huge bottleneck for large size problems. In this work, we propose an efficient parallel implementation of FFT on multiple GPUs that tackles the overhead of host memory access, by implementing a streaming scheme that hides the data transfer latency. The idea is to divide the problem into smaller ones, generating several lighter and asynchronous memory transfers from host to device enabling the computation for those data simultaneously. We obtained an acceleration of approximately 60% over the non streamed GPU implementation.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    10
    References
    7
    Citations
    NaN
    KQI
    []