logo
    Nuclear Fusion Simulation Code Optimization on GPU Clusters
    1
    Citation
    3
    Reference
    20
    Related Paper
    Citation Trend
    Abstract:
    GT5D is a nuclear fusion simulation program which aims to analyze the turbulence phenomena in tokamak plasma. In this research, we optimize it for GPU clusters with multiple GPUs on a node. Based on the profile result of GT5D on a CPU node, we decide to offload the whole of the time development part of the program to GPUs except MPI communication. We achieved 3.37 times faster performance in maximum in function level evaluation, and 2.03 times faster performance in total than the case of CPU-only execution, both in the measurement on high density GPU cluster HA-PACS where each computation node consists of four NVIDIA M2090 GPUs and two Intel Xeon E5-2670 (Sandy Bridge) to provide 16 cores in total. These performance improvements on single GPU corresponds to four CPU cores, not compared with a single CPU core. It includes 53% performance gain with overlapping the communication between MPI processes with GPU calculation.
    Keywords:
    Xeon
    GPU cluster
    Xeon Phi
    Code (set theory)
    Multi-core processor
    CPU shielding
    Graphics processing units (GPUs) have a strong floating-point capability and a high memory bandwidth in data parallelism and have been widely used in high-performance computing (HPC). Compute unified device architecture (CUDA) is used as a parallel computing platform and programming model for the GPU to reduce the complexity of programming. The programmable GPUs are becoming popular in computational fluid dynamics (CFD) applications. In this work, we propose a hybrid parallel algorithm of the message passing interface and CUDA for CFD applications on multi-GPU HPC clusters. The AUSM + UP upwind scheme and the three-step Runge–Kutta method are used for spatial discretization and time discretization, respectively. The turbulent solution is solved by the Kω SST two-equation model. The CPU only manages the execution of the GPU and communication, and the GPU is responsible for data processing. Parallel execution and memory access optimizations are used to optimize the GPU-based CFD codes. We propose a nonblocking communication method to fully overlap GPU computing, CPU_CPU communication, and CPU_GPU data transfer by creating two CUDA streams. Furthermore, the one-dimensional domain decomposition method is used to balance the workload among GPUs. Finally, we evaluate the hybrid parallel algorithm with the compressible turbulent flow over a flat plate. The performance of a single GPU implementation and the scalability of multi-GPU clusters are discussed. Performance measurements show that multi-GPU parallelization can achieve a speedup of more than 36 times with respect to CPU-based parallel computing, and the parallel algorithm has good scalability.
    GPU cluster
    Graphics processing unit
    Speedup
    Benchmark (surveying)
    Citations (30)
    We present direct astrophysical N-body simulations with up to six million bodies using our parallel MPI/CUDA code on large GPU clusters in China, with different kinds of GPU hardware. These clusters are directly linked under the Chinese Academy of Sciences special GPU cluster program. We reach about one third of the peak GPU performance for this code, in a real application scenario with individual hierarchically block time-steps high (4, 6 and 8) order Hermite integration schemes and a real core-halo density structure of the modeled stellar systems.
    GPU cluster
    Graphics processing unit
    Code (set theory)
    Citations (13)
    We present a GPU implementation of LAMMPS, a widely-used parallel molecular dynamics (MD) software package, and show 5x to 13x single node speedups versus the CPU-only version of LAMMPS. This new CUDA package for LAMMPS also enables multi-GPU simulation on hybrid heterogeneous clusters, using MPI for inter-node communication, CUDA kernels on the GPU for all methods working with particle data, and standard LAMMPS C++ code for CPU execution. Cell and neighbor list approaches are compared for best performance on GPUs, with thread-per-atom and block-per-atom neighbor list variants showing best performance at low and high neighbor counts, respectively. Computational performance results of GPU-enabled LAMMPS are presented for a variety of materials classes (e.g. biomolecules, polymers, metals, semiconductors), along with a speed comparison versus other available GPU-enabled MD software. Finally, we show strong and weak scaling performance on a CPU/GPU cluster using up to 128 dual GPU nodes.
    GPU cluster
    Citations (1)
    This paper presents the benchmarking and scaling studies of a GPU accelerated three dimensional compressible magnetohydrodynamic code. The code is developed keeping an eye to explain the large and intermediate scale magnetic field generation is cosmos as well as in nuclear fusion reactors in the light of the theory given by Eugene Newman Parker. The spatial derivatives of the code are pseudo-spectral method based and the time solvers are explicit. GPU acceleration is achieved with minimal code changes through OpenACC parallelization and use of NVIDIA CUDA Fast Fourier Transform library (cuFFT). NVIDIA's unified memory is leveraged to enable oversubscription of the GPU device memory for seamless out-of-core processing of large grids. Our experimental results indicate that the GPU accelerated code is able to achieve upto two orders of magnitude speedup over a corresponding OpenMP parallel, FFTW library based code, on a NVIDIA Tesla P100 GPU. For large grids that require out-of-core processing on the GPU, we see a 7x speedup over the OpenMP, FFTW based code, on the Tesla P100 GPU. We also present performance analysis of the GPU accelerated code on different GPU architectures - Kepler, Pascal and Volta.
    Speedup
    Code (set theory)
    Hard-core interacting particle methods are of increasing importance for simulations and game applications as well as a tool supporting animations. We develop a high accuracy numerical integration technique for managing hard-core colliding particles of various physical properties such as differing interaction species and hard-core radii using multiple Graphical Processing Unit (m-GPU) computing techniques. We report on the performance tradeoffs between communications and computations for various model parameters and for a range of individual GPU models and multiple-GPU combinations. We explore uses of the GPU Direct communications mechanisms between multiple GPUs accelerating the same CPU host and show that m-GPU multi-level parallelism is a powerful approach for complex N-Body simulations that will deploy well on commodity systems.
    Multi-core processor
    GPU cluster
    Citations (2)
    Synergia is a parallel, 3-dimensional space-charge particle-in-cell accelerator modeling code. We present our work porting the purely MPI-based version of the code to a hybrid of CPU and GPU computing kernels. The hybrid code uses the CUDA platform in the same framework as the pure MPI solution. We have implemented a lock-free collaborative charge-deposition algorithm for the GPU, as well as other optimizations, including local communication avoidance for GPUs, a customized FFT, and fine-tuned memory access patterns. On a small GPU cluster (up to 4 Tesla C1070 GPUs), our benchmarks exhibit both superior peak performance and better scaling than a CPU cluster with 16 nodes and 128 cores. We also compare the code performance on different GPU architectures, including C1070 Tesla and K20 Kepler.
    Porting
    GPU cluster
    Code (set theory)
    While GPU is becoming a compelling acceleration solution for a series of scientific applications, most existing work on climate models only achieved limited speedup. This is due to partial porting of the huge code and the memory bound inherence of these models. In this work, we design and implement a customized GPU-based acceleration of the Princeton Ocean Model (gpuPOM) based on mpiPOM, which is one of the parallel versions of the Princeton Ocean Model. Based on Nvidia's state-of-the-art GPU architectures (K20X and K40m), we rewrite the full mpiPOM model from the original Fortran version into the CUDA-C version. We present the GPU acceleration methods used in the gpuPOM, especially the techniques to ease its memory bound problem through better use of GPU's memory hierarchy. The experimental results indicate that the gpuPOM with one K40m GPU achieves from 6.3-fold to 16.7-fold speedup over different Intel multi-core CPUs and one K20X GPU achieves from 5.8-fold to 15.5-fold speedup.
    Speedup
    Porting
    Memory hierarchy
    Fortran
    Citations (12)
    We present a gravitational hierarchical N-body code that is designed to run efficiently on Graphics Processing Units (GPUs). All parts of the algorithm are executed on the GPU which eliminates the need for data transfer between the Central Processing Unit (CPU) and the GPU. Our tests indicate that the gravitational tree-code outperforms tuned CPU code for all parts of the algorithm and show an overall performance improvement of more than a factor 20, resulting in a processing rate of more than 2.8 million particles per second.
    Graphics processing unit
    Code (set theory)
    Tree (set theory)
    Factor (programming language)
    Citations (2)
    We discuss the CUDA approach to the simulation of pure gauge Lattice SU(2). CUDA is a hardware and software architecture developed by NVIDIA for computing on the GPU. We present an analysis and performance comparison between the GPU and CPU with single precision. Analysis with single and multiple GPU’s, using CUDA and OPENMP, are also presented. In order to obtain a high performance, the code must be optimized for the GPU architecture, i.e., an implementation that exploits the memory hierarchy of the CUDA programming model. Using GPU texture memory and minimizing the data transfers between CPU and GPU, we achieve a speedup of 200 using 2 NVIDIA 295 GTX cards relative to a serial CPU, which demonstrates that GPU’s can serve as an efficient platform for scientific computing. With multi-GPU’s we are able, in one day computation, to generate 1 000 000 gauge configurations in a 48 4 lattice with b = 6:0 and calculate the mean average plaquette. We present results for the mean average plaquette in several lattice sizes for different b . Finally we present results for the mean average Polyakov loop at finite temperature.
    Speedup
    Memory hierarchy
    Double-precision floating-point format
    Citations (0)