logo
    Abstract:
    A recent trend in scientific computing is the increasingly important role of co-processors, originally built to accelerate graphics rendering, and now used for general high-performance computing. The INFN Computing On Knights and Kepler Architectures (COKA) project focuses on assessing the suitability of co-processor boards for scientific computing in a wide range of physics applications, and on studying the best programming methodologies for these systems. Here we present in a comparative way our results in porting a Lattice Boltzmann code on two state-of-the-art accelerators: the NVIDIA K20X, and the Intel Xeon-Phi. We describe our implementations, analyze results and compare with a baseline architecture adopting Intel Sandy Bridge CPUs.
    Keywords:
    Porting
    Xeon Phi
    Implementation
    G EANT 4-MT is the multi-threaded version of the G EANT 4 particle transport code. (1, 2) The key goals for the design of G EANT 4-MT have been a) the need to reduce the memory footprint of the multi-threaded application compared to the use of separate jobs and processes; b) to create an easy migration of the existing applications; and c) to use efficiently many threads or cores, by scaling up to tens and potentially hundreds of workers. The first public release of a G EANT 4-MT prototype was made in 2011. We report on the revision of G EANT 4-MT for inclusion in the production-level release scheduled for end of 2013. This has involved significant re-engineering of the prototype in order to incorporate it into the main G EANT 4 development line, and the porting of G EANT 4-MT threading code to additional platforms. In order to make the porting of applications as simple as possible, refinements addressed the needs of standalone applications. Further adaptations were created to improve the fit with the frameworks of High Energy Physics (HEP) experiments. We report on performances measurements on Intel Xeon™, AMD Opteron™ the first trials of G EANT 4-MT on the Intel Many Integrated Cores (MIC) architecture, in the form of the Xeon Phi™ co-processor. (3) These indicate near-linear scaling through about 200 threads on 60 cores, when holding fixed the number of events per thread.
    Porting
    Xeon Phi
    Threading (protein sequence)
    Xeon
    Memory footprint
    Citations (18)
    This paper presents experiences using Intel's KNL MIC platform on hardware that will be available in the Stampede 2 cluster launching in Summer 2017. We focus on 1) porting of existing scientific software; 2) observing performance of this software. Additionally, we comment on both the ease of use of KNL and observed performance of KNL as compared to previous generation "Knights Ferry" and "Knights Corner" Xeon Phi MICs [32]. Fortran, C, and C++ applications are chosen from a variety of scientific disciplines including computational fluid dynamics, numerical linear algebra, uncertainty quantification, finite element methods, and computational chemistry.
    Xeon Phi
    Porting
    Xeon
    Fortran
    Linear algebra
    Vectorization (mathematics)
    Citations (4)
    With the increasing size and complexity of data produced by large scale numerical simulations, it is of primary importance for scientists to be able to exploit all available hardware in heterogenous High Performance Computing environments for increased throughput and efficiency. We focus on the porting and optimization of Splotch, a scalable visualization algorithm, to utilize the Xeon Phi, Intel's coprocessor based upon the new Many Integrated Core architecture. We discuss steps taken to offload data to the coprocessor and algorithmic modifications to aid faster processing on the many-core architecture and make use of the uniquely wide vector capabilities of the device, with accompanying performance results using multiple Xeon Phi. Finally performance is compared against results achieved with the GPU implementation of Splotch.
    Porting
    Xeon Phi
    Coprocessor
    Xeon
    Multi-core processor
    Citations (0)
    Summary In this paper, we report our experience of porting and optimization of legacy seismic acoustic modelling application on multi and many cores of hybrid architecture of PARAM Yuva II. This application was developed using MPI and used domain decomposition as parallelization approach across parallel processors. The same application has been modified for domain decomposition at node level and the parallel performance was improved using OpenMP within the node. The resultant application was optimized using different optimization techniques for multi core architecture of Intel’s Xeon, which further improved the performance of the application along with efficiency. The optimized application was then ported on many core architecture of Intel’s Xeon Phi in native and symmetric modes. The details of porting, optimizations and execution on Intel’s Xeon and on Xeon Phi in native and symmetric modes are given in the paper. Performance, scalability and efficiency of the application has been studied using multi and many cores and experimental results are presented.
    Porting
    Xeon Phi
    Xeon
    Multi-core processor
    We report on our investigations into the viability of the ARM processor and the Intel Xeon Phi co-processor for scientific computing. We describe our experience porting software to these processors and running benchmarks using real physics applications to explore the potential of these processors for production physics processing.
    Porting
    Xeon Phi
    Xeon
    In this paper we report our experiences in porting the FEASTFLOW software infrastructure to the Intel Xeon Phi coprocessor. Our efforts involved both the evaluation of programming models including OpenCL, POSIX threads and OpenMP and typical optimization strategies like parallelization and vectorization. Since the straightforward porting process of the already existing OpenCL version of the code encountered performance problems that require further analysis, we focused our efforts on the implementation and optimization of two core building block kernels for FEASTFLOW: an axpy vector operation and a sparse matrix-vector multiplication (spmv). Our experimental results on these building blocks indicate the Xeon Phi can serve as a promising accelerator for our software infrastructure.
    Porting
    Xeon Phi
    Coprocessor
    Vectorization (mathematics)
    Citations (12)
    With the increasing size and complexity of data produced by large scale numerical simulations, it is of primary importance for scientists to be able to exploit all available hardware in heterogenous High Performance Computing environments for increased throughput and efficiency. We focus on the porting and optimization of Splotch, a scalable visualization algorithm, to utilize the Xeon Phi, Intel's coprocessor based upon the new Many Integrated Core architecture. We discuss steps taken to offload data to the coprocessor and algorithmic modifications to aid faster processing on the many-core architecture and make use of the uniquely wide vector capabilities of the device, with accompanying performance results using multiple Xeon Phi. Finally performance is compared against results achieved with the GPU implementation of Splotch.
    Porting
    Xeon Phi
    Coprocessor
    Xeon
    Multi-core processor
    Citations (1)
    This work describes the challenges presented by porting code to the Intel Xeon Phi coprocessor, as well as opportunities for optimization and tuning. We use micro-benchmarks, code segments, assembly listings and application level results to illustrate the key issues in porting to the Xeon Phi coprocessor, always keeping in mind both portability and performance. While executing code on the Xeon Phi in native mode is fairly straightforward it can be a challenge to achieve good performance. The complexity of optimization increases as one introduces offload, distributed offload, or symmetric execution modes. We will initially focus on the fundamental issues that can prevent acceptable performance in native execution, and then address the key issues in data transfers due to either offloaded regions or MPI exchanges with the host CPU. Some of the issues are of a generic nature and affect any code using heterogeneous execution - PCIe bandwidth bottleneck -, and others are specific to the Xeon Phi and its software environment - Host/MIC MPI exchanges. We will also make an effort to indicate which issues are specific to this platform and which are of general applicability. In particular we will draw comparisons between the data management models in the Intel Xeon Phi and in the NVIDIA CUDA environment.
    Porting
    Xeon Phi
    Software portability
    Coprocessor
    Code (set theory)
    x86
    Xeon
    Citations (35)