The ICON-A model for direct QBO simulations on GPUs (version icon-cscs:baf28a514)

Geoscientific model development (2022)

M. A. Giorgetta William Sawyer Xavier Lapillonne Panagiotis Adamidis Dmitry Alexeev Valentin Clément Remo Dietlicher Jan Frederik Engels Monika Esch Henning Franke Claudia Frauen Walter M. Hannah B. R. Hillman Luis Kornblueh Philippe Marti Matthew Norman Robert Pincus Sebastian Rast Daniel Reinert Reiner Schnur Uwe Schulzweida Björn Stevens

Citation

Reference

Related Paper

Citation Trend

Abstract:

Abstract. Classical numerical models for the global atmosphere, as used for numerical weather forecasting or climate research, have been developed for conventional central processing unit (CPU) architectures. This hinders the employment of such models on current top-performing supercomputers, which achieve their computing power with hybrid architectures, mostly using graphics processing units (GPUs). Thus also scientific applications of such models are restricted to the lesser computer power of CPUs. Here we present the development of a GPU-enabled version of the ICON atmosphere model (ICON-A), motivated by a research project on the quasi-biennial oscillation (QBO), a global-scale wind oscillation in the equatorial stratosphere that depends on a broad spectrum of atmospheric waves, which originates from tropical deep convection. Resolving the relevant scales, from a few kilometers to the size of the globe, is a formidable computational problem, which can only be realized now on top-performing supercomputers. This motivated porting ICON-A, in the specific configuration needed for the research project, in a first step to the GPU architecture of the Piz Daint computer at the Swiss National Supercomputing Centre and in a second step to the JUWELS Booster computer at the Forschungszentrum Jülich. On Piz Daint, the ported code achieves a single-node GPU vs. CPU speedup factor of 6.4 and allows for global experiments at a horizontal resolution of 5 km on 1024 computing nodes with 1 GPU per node with a turnover of 48 simulated days per day. On JUWELS Booster, the more modern hardware in combination with an upgraded code base allows for simulations at the same resolution on 128 computing nodes with 4 GPUs per node and a turnover of 133 simulated days per day. Additionally, the code still remains functional on CPUs, as is demonstrated by additional experiments on the Levante compute system at the German Climate Computing Center. While the application shows good weak scaling over the tested 16-fold increase in grid size and node count, making also higher resolved global simulations possible, the strong scaling on GPUs is relatively poor, which limits the options to increase turnover with more nodes. Initial experiments demonstrate that the ICON-A model can simulate downward-propagating QBO jets, which are driven by wave–mean flow interaction.

Keywords:

Porting

Graphics processing unit

Speedup

Icon

Topics:

Meteorological Phenomena and Simulations

Climate variability and models

Atmospheric Ozone and Climate

10.5194/gmd-15-6985-2022

Cite

PDF

Implementation of a Non-bonded Interaction Calculation Algorithm for the Cell Architecture

Lecture notes in computer science (2009)

Э. С. Фомин Nikolay A. Alemasov

Porting

Speedup

10.1007/978-3-642-03275-2_39

Cite

Citations (5)

Porting μC/OS-II based on NE-STR750

Jie Song Xinlu Li

This paper analyses the system architecture and the characters of STR750 and real time OS μC/OS-II. The procedure on how to porting μC/OS-II based on STR750 is presented in detail. Important porting files and some source code are introduced. At last, the main tasks of porting μC/OS-II on NE-STR750 developing boarding with IAR EWARM IDE are introduced. This work will make next developing expediently for future application. The procedure of μC/OS-II's porting also can be a good operating system study example for computer education.

Porting

Code (set theory)

10.1049/ic:20080292

Cite

Citations (0)

Roshydromet supercomputer technologies for numerical weather prediction

Russian Meteorology and Hydrology (2017)

A. I. Bedritskii R. M. Vil’fand D. B. Kiktev Г. С. Ривин

Weather prediction

10.3103/s1068373917070019

Cite

Citations (4)

A Fast Implementation of IMRT Algorithm by the GPU

Chengjun Gou

In this paper,we study the speedup effect when graphics processing unit(GPU) is used in intensity modulated radiation therapy(IMRT).The pencil-beam-dose-response matrix multiplication calculation in optimization process is implemented in compute unified device architecture(CUDA) running on the GPU,and C running on the CPU.The speedup factors are compared and analyzed.Test results show that the maximum relative error of 5.822×10~(-7) can be found between the CPU results and GPU results,such discrepancy level is acceptable clinically,and the speedup factors reach 9-12 by using GPU.

Speedup

Graphics processing unit

Cite

Citations (0)

Porting of Embedded Real-time OS μC/OS-II to DSP

Aeronautical Computing Technique (2008)

Song Zhi-gang

This paper describes the features of Embedded Real-time OS,μC/OS-Ⅱ,and discusses the realization of porting μC/OS-Ⅱ to DSP TMS320C6416 in detail.It also presents the most important and difficult problems in porting μC/OS-Ⅱ,tests the core of the ported system,and cuts the latter as well.Multi-tasks running in the ported system shows that the ported program works steadily and reliably,many performance parameters meet the basic demands of embedded development.

Porting

Realization (probability)

Real-time operating system

Source

Cite

Citations (0)

Shared virtual memory and generalized speedup

Xian–He Sun Jianping Zhu

Generalized speedup is defined as parallel speed over sequential speed. In this paper the generalized speedup and its relation with other existing performance metrics, such as traditional speedup, efficiency, scalability, etc., are carefully studied. In terms of the introduced asymptotic speed, it is shown that the difference between the generalized speedup and the traditional speedup lies in the definition of the efficiency of uniprocessor processing, which is a very important issue in shared virtual memory machines. A scientific application has been implemented on a KSR-1 parallel computer. Experimental and theoretical results show that the generalized speedup is distinct from the traditional speedup and provides a more reasonable measurement. In the study of different speedups, various causes of superlinear speedup are also presented.< >

Speedup

Uniprocessor system

10.1109/ipps.1994.288237

Cite

Citations (13)

Another view on parallel speedup

Conference on High Performance Computing (Supercomputing) (1990)

Xian–He Sun Lionel M. Ni

In this paper three models of parallel speedup are studied. They are fixed-size speedup, fixed-time speedup and memory-bounded speedup. Two sets of speedup formulations are derived for these three models. One set requires more information and gives more accurate estimation. Another set considers a simplified case and provides a clear picture of possible performance gain of parallel processing. The simplified fixed-size speedup is Amdahl's law. The simplified fixed-time speedup is Gustafson's scaled speedup. The simplified memory-bounded speedup contains both Amdahl's law and Gustafson's scaled speedup as its special cases. This study proposes a new metric for performance evaluation and leads to a better understanding of parallel processing.

Speedup

10.5555/110382.110450

Cite

Citations (105)

Gadget3 on GPUs with OpenACC

Advances in parallel computing (2020)

Antonio Ragagnin Klaus Dolag Mathias Wagner C. Gheller Conradin Roffler

We present preliminary results of a GPU porting of all main Gadget3 modules (gravity computation, SPH density computation, SPH hydrodynamic force, and thermal conduction) using OpenACC directives. Here we assign one GPU to each MPI rank and exploit both the host and accellerator capabilities by overlapping computations on the CPUs and GPUs: while GPUs asynchronously compute interactions between particles within their MPI ranks, CPUs perform tree-walks and MPI communications of neighbouring particles. We profile various portions of the code to understand the origin of our speedup, where we find that a peak speedup is not achieved because of time-steps with few active particles. We run a hydrodynamic cosmological simulation from the Magneticum project, with 2·107 particles, where we find a final total speedup of ≈2. We also present the results of an encouraging scaling test of a preliminary gravity-only OpenACC porting, run in the context of the EuroHack17 event, where the prototype of the porting proved to keep a constant speedup up to 1024 GPUs.

Speedup

Porting

10.3233/apc200043

Cite

Citations (3)

Gregex: GPU Based High Speed Regular Expression Matching Engine

Lei Wang Shuhui Chen Yong Tang Jinshu Su

Regular expression matching engine is a crucial infrastructure which is widely used in network security systems, like IDS. We propose Gregex, a Graphics Processing Unit (GPU) based regular expression matching engine for deep packet inspection (DPI). Gregex leverages the computational power and high memory bandwidth of GPUs by storing data in proper GPU memory space and executing massive GPU thread concurrently to process lots of packets in parallel. Three optimization techniques, ATP, CAB, and CAT are proposed to significantly improve the performance of Gregex. On a GTX260 GPU, Gregex achieves a regular matching throughput of 126.8 Gbps, which is a speedup of 210× over traditional CPU-based implementation and a speedup of 7.9× over the state-of-the-art GPU based regular expression engine.

Speedup

Graphics processing unit

Regular expression

Coprocessor

Deep Packet Inspection

Memory bandwidth

High memory

10.1109/imis.2011.107

Cite

Citations (31)

Porting of Real-time Operation System μC/OS-II on C8051F120

Mechanical Engineering & Automation (2009)

Tian Juan

The paper introduces the characteristic of real-time operation system μC/OS-Ⅱ,and discusses the necessity of porting μC/OS-Ⅱ on 51 series MCU.Then the specific process of porting μC/OS-Ⅱ on C8051F120 which is selected as porting target is presented.The paper ends with designing test program to prove the success of the porting.

Porting

Source

Cite

Citations (0)