logo
    Classical Mechanical Hard-Core Particles Simulated in a Rigid Enclosure using Multi-GPU Systems
    2
    Citation
    38
    Reference
    20
    Related Paper
    Citation Trend
    Abstract:
    Hard-core interacting particle methods are of increasing importance for simulations and game applications as well as a tool supporting animations. We develop a high accuracy numerical integration technique for managing hard-core colliding particles of various physical properties such as differing interaction species and hard-core radii using multiple Graphical Processing Unit (m-GPU) computing techniques. We report on the performance tradeoffs between communications and computations for various model parameters and for a range of individual GPU models and multiple-GPU combinations. We explore uses of the GPU Direct communications mechanisms between multiple GPUs accelerating the same CPU host and show that m-GPU multi-level parallelism is a powerful approach for complex N-Body simulations that will deploy well on commodity systems.
    Keywords:
    Multi-core processor
    GPU cluster
    GT5D is a nuclear fusion simulation program which aims to analyze the turbulence phenomena in tokamak plasma. In this research, we optimize it for GPU clusters with multiple GPUs on a node. Based on the profile result of GT5D on a CPU node, we decide to offload the whole of the time development part of the program to GPUs except MPI communication. We achieved 3.37 times faster performance in maximum in function level evaluation, and 2.03 times faster performance in total than the case of CPU-only execution, both in the measurement on high density GPU cluster HA-PACS where each computation node consists of four NVIDIA M2090 GPUs and two Intel Xeon E5-2670 (Sandy Bridge) to provide 16 cores in total. These performance improvements on single GPU corresponds to four CPU cores, not compared with a single CPU core. It includes 53% performance gain with overlapping the communication between MPI processes with GPU calculation.
    Xeon
    GPU cluster
    Xeon Phi
    Code (set theory)
    Multi-core processor
    CPU shielding
    Citations (1)
    Graphics processing units (GPUs) have a strong floating-point capability and a high memory bandwidth in data parallelism and have been widely used in high-performance computing (HPC). Compute unified device architecture (CUDA) is used as a parallel computing platform and programming model for the GPU to reduce the complexity of programming. The programmable GPUs are becoming popular in computational fluid dynamics (CFD) applications. In this work, we propose a hybrid parallel algorithm of the message passing interface and CUDA for CFD applications on multi-GPU HPC clusters. The AUSM + UP upwind scheme and the three-step Runge–Kutta method are used for spatial discretization and time discretization, respectively. The turbulent solution is solved by the Kω SST two-equation model. The CPU only manages the execution of the GPU and communication, and the GPU is responsible for data processing. Parallel execution and memory access optimizations are used to optimize the GPU-based CFD codes. We propose a nonblocking communication method to fully overlap GPU computing, CPU_CPU communication, and CPU_GPU data transfer by creating two CUDA streams. Furthermore, the one-dimensional domain decomposition method is used to balance the workload among GPUs. Finally, we evaluate the hybrid parallel algorithm with the compressible turbulent flow over a flat plate. The performance of a single GPU implementation and the scalability of multi-GPU clusters are discussed. Performance measurements show that multi-GPU parallelization can achieve a speedup of more than 36 times with respect to CPU-based parallel computing, and the parallel algorithm has good scalability.
    GPU cluster
    Graphics processing unit
    Speedup
    Benchmark (surveying)
    Citations (30)
    The visualization of molecular orbitals (MOs) is important for analyzing the results of quantum chemistry simulations. The functions describing the MOs are computed on a three-dimensional lattice, and the resulting data can then be used for plotting isocontours or isosurfaces for visualization as well as for other types of analyses. Existing software packages that render MOs perform calculations on the CPU and require runtimes of tens to hundreds of seconds depending on the complexity of the molecular system.
    Lattice (music)
    Citations (67)
    We present a GPU implementation of LAMMPS, a widely-used parallel molecular dynamics (MD) software package, and show 5x to 13x single node speedups versus the CPU-only version of LAMMPS. This new CUDA package for LAMMPS also enables multi-GPU simulation on hybrid heterogeneous clusters, using MPI for inter-node communication, CUDA kernels on the GPU for all methods working with particle data, and standard LAMMPS C++ code for CPU execution. Cell and neighbor list approaches are compared for best performance on GPUs, with thread-per-atom and block-per-atom neighbor list variants showing best performance at low and high neighbor counts, respectively. Computational performance results of GPU-enabled LAMMPS are presented for a variety of materials classes (e.g. biomolecules, polymers, metals, semiconductors), along with a speed comparison versus other available GPU-enabled MD software. Finally, we show strong and weak scaling performance on a CPU/GPU cluster using up to 128 dual GPU nodes.
    GPU cluster
    Citations (1)
    Abstract Molecular dynamics (MD) simulation is a powerful computational tool to study the behaviour of macromolecular systems. However, many simulations in this field are limited in spatial or temporal scale by the available computational resource. In recent years, graphics processing units (GPUs) have provided unprecedented computational power for scientific applications. Many MD algorithms suit the multithread nature of GPU. In this paper, MD algorithms for macromolecular systems that run entirely on GPU are presented. For validation, we have performed MD simulations of polymer crystallisation with our GPU package, GPU_MD-1.0.5, and the results agree perfectly with computations on CPUs, meanwhile GPU_MD-1.0.5 achieves about 39 times speedup compared with GROMACS-4.0.5 on a single CPU core. Therefore, our single GPU code has already provided an inexpensive alternative for macromolecular simulations of traditional CPU clusters and will serve as a basis for developing parallel GPU programs to further speed up the computations. Keywords: macromoleculemolecular dynamicsspeedupGPUCUDA Acknowledgements This work is supported by the National Natural Science Foundation of China under the Grants Nos 20874107, 20821092 and 20821092 and Chinese Academy of Sciences under the Grants Nos KJCX2-SW-L08 and KGCX2-YW-124.
    Graphics processing unit
    Speedup
    GPU cluster
    This paper presents the benchmarking and scaling studies of a GPU accelerated three dimensional compressible magnetohydrodynamic code. The code is developed keeping an eye to explain the large and intermediate scale magnetic field generation is cosmos as well as in nuclear fusion reactors in the light of the theory given by Eugene Newman Parker. The spatial derivatives of the code are pseudo-spectral method based and the time solvers are explicit. GPU acceleration is achieved with minimal code changes through OpenACC parallelization and use of NVIDIA CUDA Fast Fourier Transform library (cuFFT). NVIDIA's unified memory is leveraged to enable oversubscription of the GPU device memory for seamless out-of-core processing of large grids. Our experimental results indicate that the GPU accelerated code is able to achieve upto two orders of magnitude speedup over a corresponding OpenMP parallel, FFTW library based code, on a NVIDIA Tesla P100 GPU. For large grids that require out-of-core processing on the GPU, we see a 7x speedup over the OpenMP, FFTW based code, on the Tesla P100 GPU. We also present performance analysis of the GPU accelerated code on different GPU architectures - Kepler, Pascal and Volta.
    Speedup
    Code (set theory)
    Over time, more and more data being produced.We need high-processing computing to process this big data.One solution to this problem is to implement parallel processing using a Graphical Computing Unit (GPU).Theoretically, processing mathematical computation on GPU will always be faster than CPU, because GPU has more than hundreds of Arithmetic Logical Units (ALUs) while CPU only has less than 10 ALUs.But to be able to process data on GPU, we need to explicitly transfer data from RAM to global memory of GPU.This process creates fairly high cost.In this research, we analyze performance of GPU compared to CPU for 2 mathematical computations, namely 1-dimensional vector addition and 2-dimensional matrix multiplication.From experimental result, we can conclude that for 1-dimensional vector addition, however big the data size, it is better to use CPU than GPU.In this case, cost of data transmission is more significant than acceleration of parallel computational process.For 2-dimensional matrix multiplication, if we use matrix larger than 96 x 96 floating point, it is better to use GPU than CPU.
    Citations (1)