The ICON-A model for direct QBO simulations on GPUs (version icon-cscs:baf28a514)
M. A. GiorgettaWilliam SawyerXavier LapillonnePanagiotis AdamidisDmitry AlexeevValentin ClémentRemo DietlicherJan Frederik EngelsMonika EschHenning FrankeClaudia FrauenWalter M. HannahB. R. HillmanLuis KornbluehPhilippe MartiMatthew NormanRobert PincusSebastian RastDaniel ReinertReiner SchnurUwe SchulzweidaBjörn Stevens
27
Citation
32
Reference
10
Related Paper
Citation Trend
Abstract:
Abstract. Classical numerical models for the global atmosphere, as used for numerical weather forecasting or climate research, have been developed for conventional central processing unit (CPU) architectures. This hinders the employment of such models on current top-performing supercomputers, which achieve their computing power with hybrid architectures, mostly using graphics processing units (GPUs). Thus also scientific applications of such models are restricted to the lesser computer power of CPUs. Here we present the development of a GPU-enabled version of the ICON atmosphere model (ICON-A), motivated by a research project on the quasi-biennial oscillation (QBO), a global-scale wind oscillation in the equatorial stratosphere that depends on a broad spectrum of atmospheric waves, which originates from tropical deep convection. Resolving the relevant scales, from a few kilometers to the size of the globe, is a formidable computational problem, which can only be realized now on top-performing supercomputers. This motivated porting ICON-A, in the specific configuration needed for the research project, in a first step to the GPU architecture of the Piz Daint computer at the Swiss National Supercomputing Centre and in a second step to the JUWELS Booster computer at the Forschungszentrum Jülich. On Piz Daint, the ported code achieves a single-node GPU vs. CPU speedup factor of 6.4 and allows for global experiments at a horizontal resolution of 5 km on 1024 computing nodes with 1 GPU per node with a turnover of 48 simulated days per day. On JUWELS Booster, the more modern hardware in combination with an upgraded code base allows for simulations at the same resolution on 128 computing nodes with 4 GPUs per node and a turnover of 133 simulated days per day. Additionally, the code still remains functional on CPUs, as is demonstrated by additional experiments on the Levante compute system at the German Climate Computing Center. While the application shows good weak scaling over the tested 16-fold increase in grid size and node count, making also higher resolved global simulations possible, the strong scaling on GPUs is relatively poor, which limits the options to increase turnover with more nodes. Initial experiments demonstrate that the ICON-A model can simulate downward-propagating QBO jets, which are driven by wave–mean flow interaction.Keywords:
Porting
Graphics processing unit
Speedup
Icon
Porting
Speedup
Cite
Citations (5)
This paper analyses the system architecture and the characters of STR750 and real time OS μC/OS-II. The procedure on how to porting μC/OS-II based on STR750 is presented in detail. Important porting files and some source code are introduced. At last, the main tasks of porting μC/OS-II on NE-STR750 developing boarding with IAR EWARM IDE are introduced. This work will make next developing expediently for future application. The procedure of μC/OS-II's porting also can be a good operating system study example for computer education.
Porting
Code (set theory)
Cite
Citations (0)
Weather prediction
Cite
Citations (4)
In this paper,we study the speedup effect when graphics processing unit(GPU) is used in intensity modulated radiation therapy(IMRT).The pencil-beam-dose-response matrix multiplication calculation in optimization process is implemented in compute unified device architecture(CUDA) running on the GPU,and C running on the CPU.The speedup factors are compared and analyzed.Test results show that the maximum relative error of 5.822×10~(-7) can be found between the CPU results and GPU results,such discrepancy level is acceptable clinically,and the speedup factors reach 9-12 by using GPU.
Speedup
Graphics processing unit
Cite
Citations (0)
This paper describes the features of Embedded Real-time OS,μC/OS-Ⅱ,and discusses the realization of porting μC/OS-Ⅱ to DSP TMS320C6416 in detail.It also presents the most important and difficult problems in porting μC/OS-Ⅱ,tests the core of the ported system,and cuts the latter as well.Multi-tasks running in the ported system shows that the ported program works steadily and reliably,many performance parameters meet the basic demands of embedded development.
Porting
Realization (probability)
Real-time operating system
Cite
Citations (0)
Generalized speedup is defined as parallel speed over sequential speed. In this paper the generalized speedup and its relation with other existing performance metrics, such as traditional speedup, efficiency, scalability, etc., are carefully studied. In terms of the introduced asymptotic speed, it is shown that the difference between the generalized speedup and the traditional speedup lies in the definition of the efficiency of uniprocessor processing, which is a very important issue in shared virtual memory machines. A scientific application has been implemented on a KSR-1 parallel computer. Experimental and theoretical results show that the generalized speedup is distinct from the traditional speedup and provides a more reasonable measurement. In the study of different speedups, various causes of superlinear speedup are also presented.< >
Speedup
Uniprocessor system
Cite
Citations (13)
In this paper three models of parallel speedup are studied. They are fixed-size speedup, fixed-time speedup and memory-bounded speedup. Two sets of speedup formulations are derived for these three models. One set requires more information and gives more accurate estimation. Another set considers a simplified case and provides a clear picture of possible performance gain of parallel processing. The simplified fixed-size speedup is Amdahl's law. The simplified fixed-time speedup is Gustafson's scaled speedup. The simplified memory-bounded speedup contains both Amdahl's law and Gustafson's scaled speedup as its special cases. This study proposes a new metric for performance evaluation and leads to a better understanding of parallel processing.
Speedup
Cite
Citations (105)
We present preliminary results of a GPU porting of all main Gadget3 modules (gravity computation, SPH density computation, SPH hydrodynamic force, and thermal conduction) using OpenACC directives. Here we assign one GPU to each MPI rank and exploit both the host and accellerator capabilities by overlapping computations on the CPUs and GPUs: while GPUs asynchronously compute interactions between particles within their MPI ranks, CPUs perform tree-walks and MPI communications of neighbouring particles. We profile various portions of the code to understand the origin of our speedup, where we find that a peak speedup is not achieved because of time-steps with few active particles. We run a hydrodynamic cosmological simulation from the Magneticum project, with 2·107 particles, where we find a final total speedup of ≈2. We also present the results of an encouraging scaling test of a preliminary gravity-only OpenACC porting, run in the context of the EuroHack17 event, where the prototype of the porting proved to keep a constant speedup up to 1024 GPUs.
Speedup
Porting
Cite
Citations (3)
Regular expression matching engine is a crucial infrastructure which is widely used in network security systems, like IDS. We propose Gregex, a Graphics Processing Unit (GPU) based regular expression matching engine for deep packet inspection (DPI). Gregex leverages the computational power and high memory bandwidth of GPUs by storing data in proper GPU memory space and executing massive GPU thread concurrently to process lots of packets in parallel. Three optimization techniques, ATP, CAB, and CAT are proposed to significantly improve the performance of Gregex. On a GTX260 GPU, Gregex achieves a regular matching throughput of 126.8 Gbps, which is a speedup of 210× over traditional CPU-based implementation and a speedup of 7.9× over the state-of-the-art GPU based regular expression engine.
Speedup
Graphics processing unit
Regular expression
Coprocessor
Deep Packet Inspection
Memory bandwidth
High memory
Cite
Citations (31)
The paper introduces the characteristic of real-time operation system μC/OS-Ⅱ,and discusses the necessity of porting μC/OS-Ⅱ on 51 series MCU.Then the specific process of porting μC/OS-Ⅱ on C8051F120 which is selected as porting target is presented.The paper ends with designing test program to prove the success of the porting.
Porting
Cite
Citations (0)