logo
    XOR Hashing Algorithms to Measured Flows at the High-Speed Link
    5
    Citation
    9
    Reference
    10
    Related Paper
    Citation Trend
    Abstract:
    Flow sampling algorithms and data streaming algorithms have widely adopted the hashing algorithms to detect flow information at the high-speed links for the applications of the flow distribution, the number of flow, the heavy-tailed flow, and so on. Hashing algorithms involve transforming a key inside a hash value. The hashing algorithms used in high-speed networks have some specific properties: uniformity of distribution, computation speed. In this paper, we analyze the XOR operator in the hash-based flow selection function at the high-speed links, and prove that bit XOR operator can increase the uniformity of distribution. The hashing algorithms on the XOR operate are evaluated by mathematical analysis and trace-driven experiments on CERNET.
    Keywords:
    Bitwise operation
    Speedup
    TRACE (psycholinguistics)
    The Preconditioned Conjugate Gradient (PCG) method is widely used for solving linear systems of equations with sparse matrices. A recent version of PCG, Pipelined PCG, eliminates the dependencies in the computations of the PCG algorithm so that the non-dependent computations can be overlapped with communication. In this paper, we propose three methods for efficient execution of the Pipelined PCG algorithm on GPU accelerated heterogeneous architectures. The first two methods achieve task-parallelism using asynchronous executions of different tasks on CPU cores and GPU. The third method achieves data parallelism by decomposing the workload between CPU and GPU based on a performance model. The performance model takes into account the relative performance of CPU cores and GPU using some initial executions and performs 2D data decomposition. We also implement optimization strategies like kernel fusion for GPU and merging vector operations for CPU. Our methods give up to 8x speedup and on average 3x speedup over PCG CPU implementation of Paralution and PETSc libraries. They also give up to 5x speedup and on average 1.45x speedup over PCG GPU implementation of Paralution and PETSc libraries. The third method also provides an efficient solution for solving problems that cannot be fit into the GPU memory and gives up to 2.5x speedup for such problems.
    Speedup
    Kernel (algebra)
    Citations (0)
    In a previous report (EGG-SPAG-7682, April 1987) we described the first results of our computational experiments on the Intel iPSC-286 (a 32 node hypercube with nodes based on the 80286/80287 chip) with certain algorithms for concurrent computational continuum dynamics. The first results were disappointing: the best speedup factors were less than 4 out of a possible 32. Modifications of our earlier algorithms to better fit the iPSC architecture led to significant improvement: the best speedup factors on these improved algorithms are now in excess of 24 out of 32. The earlier parallel algorithms had speedup efficiencies of 90% or better on several other parallel processors, namely, the HEP, ELXSI/10, and CRAY X-MP/4. These machines handled the global communications requird by the earlier parallel algorithms more efficiently than the iPSC. The algorithmic modifications yielding the greatest speedup factor improvements reduced the amount of communications relative to the amount of computations. 11 refs., 1 tab.
    Speedup
    Intel iPSC
    Citations (1)
    Memory optimization is an important strategy to gain high performance for sequence alignment implemented by CUDA on GPGPU. Smith-Waterman (SW) algorithm is the most sensitive algorithm widely used for local sequence alignment but very time consuming. Although several parallel methods have been used in some studies and shown good performances, advantages of GPGPU memory hierarchy are still not fully exploited. This paper presents a new parallel method on GPGPU using on-chip memory more efficiently to optimize parallel Smith-Waterman sequence alignment presented by Gregory M. Striemer. To minimize the cost of data transfers, on-chip shared memory is used to store intermediate results. Constant memory is also used effectively in our implementation of parallel Smith-Waterman algorithm. Using these two kinds of on-chip memory decreases long latency memory access operations, and reduces demand for global memory when aligning longer sequences. The experimental results show 1.66x to 3.16x speedup over Gregory's parallel SW on GPGPU in terms of execution time and 19.70x speedup on average and 22.43x speedup peak performance over serial SW in terms of clock cycles on our computer platform.
    Speedup
    Smith–Waterman algorithm
    Memory hierarchy
    Performance Improvement
    • Multi-core speedup — Speedup highly depends on parallelizability of algorithms (serial dependencies vs. independent computation) — Results indicate performance of the benchmark • SIMD speedup — Achieved speedup of 1.3 considerably low — Speedup highly depends on efficient utilization of the AltiVecunit, e.g. constant availability of input data • Compiler optimization — Speedup between 1.5 and 1.7 — The higher the per-compiler optimization the lower the effect of compiler optimization (cf. hogC which has the highest degree of pre-compiler optimization) — Compiler optimization can partly compensate for pre-compiler optimizations of code • Overall Conclusions — Modern multi-core processors with SIMD features provide huge performance boost compared to single-core variants (for average cases) — Overall performance of CPUs still outperformed by GPUs
    Speedup
    SIMD
    Benchmark (surveying)
    Interprocedural optimization
    Multi-core processor
    Citations (0)
    The design trend towards CMPs has made the simulation of multiprocessor systems a necessity and has also made multiprocessor systems widely available. While a serial multiprocessor simulation necessarily imposes a linear slowdown, running such a simulation in parallel may help mitigate this effect. In this paper we document our experiences with two different methods of parallelizing Sparc Sulima, a simulator of UltraSPARC IIICu-based multiprocessor systems. In the first approach, a simple interconnect model within the simulator is parallelized non-deterministically using careful locking. In the second, a detailed interconnect model is parallelized while preserving determinism using parallel discrete event simulation (PDES) techniques. While both approaches demonstrate a threefold speedup using 4 threads on workloads from the NAS parallel benchmarks, speedup proved constrained by load-balancing between simulated processors. A theoretical model is developed to help understand why observed speedup is less than ideal. An analysis of the related speed-accuracy tradeoff in the first approach with respect to the simulation time quantum is also given; the results show that, for both serial and parallel simulation, a quantum in the order of a few hundreds of cycles represents a `sweet-spot', but parallel simulation is significantly more accurate for a given quantum size. As with the speedup analysis, these effects are workload dependent
    Speedup
    Discrete-Event Simulation
    Symmetric multiprocessor system
    Citations (11)
    How much speedup can we expect for large scientific parallel programs running on supercomputers. For insight into this problem we extend the parallel processing environment currently existing on the Cray X-MP (a shared memory multiprocessor with at most four processors) to a simulated N-processor environment, where N less than or equal to 1. Several large scientific parallel programs from Los Alamos National Laboratory were run in this simulated environment, and speedups were predicted. A speedup of 14.4 on 16 processors was measured for one of the three most used codes at the Laboratory.
    Speedup
    Parallel processing
    Citations (0)
    Paper presents speedup achieved through parallelization of code for computing π. Codes are implemented in C# with .NET framework and in C with OpenMP, on machine with i7 processor. Parallelization of code, more precisely embarassingly parallel problem, should show linear speedup, but as as shown in the following paper the same was not proven to be right. The differences in speedup between OpenMP and Task Parallel Library, hereinafter referenced as TPL are demonstrated by calculating speedup in different scenarios. Problem remains the same through the scenarios, but the number of iterations and the number of cores activated are changed. Finally, results are presented comparing the time needed for execution of serial and parallel computing. Ultimately, the results show that OpenMP is parallelization tool that is adviced to use while solving problems similar to the problem that is considered in this paper.
    Speedup
    Code (set theory)
    Automatic parallelization
    Efficient Implementation of 3D Finite Difference Schemes on Recent ProcessorsAbstractIn this paper a solver is introduced that solves a problem set modelled by the Burgers equation using the finite difference method: forward in time and central in space. The solver is parallelized and optimized for Intel Xeon Phi 7120P as well as Intel Xeon E5-2699v3 processors to investigate differences in terms of performance between the two architectures.Optimized data access and layout have been implemented to ensure good cache utilization. Loop tiling strategies are used to adjust data access with respect to the L2 cache size. Compiler hints describing aligned memory access are used to support vectorization on both processors. Additionally, prefetching strategies and streaming stores have been evaluated for the Intel Xeon Phi. Parallelization was done using OpenMP and MPI.The parallelisation for native execution on Xeon Phi is based on OpenMP and yielded a raw performance of nearly 100 GFLOP/s, reaching a speedup of almost 50 at a 83\% parallel efficiency. An OpenMP implementation on the E5-2699v3 (Haswell) processors produced up to 292 GFLOP/s, reaching a speedup of almost 31 at a 85\% parallel efficiency. For comparison a mixed implementation using interleaved communications with computations reached 267 GFLOP/s at a speedup of 28 with a 87\% parallel efficiency. Running a pure MPI implementation on the PDC's Beskow supercomputer with 16 nodes yielded a total performance of 1450 GFLOP/s and for a larger problem set it yielded a total of 2325 GFLOP/s, reaching a speedup and parallel efficiency at resp. 170 and 33,3\% and 290 and 56\%.An analysis based on the roofline performance model shows that the computations were memory bound to the L2 cache bandwidth, suggesting good L2 cache utilization for both the Haswell and the Xeon Phi's architectures. Xeon Phi performance can probably be improved by also using MPI. Keeping technological progress for computational cores in the Haswell processor in mind for the comparison, both processors perform well. Improving the stencil computations to a more compiler friendly form might improve performance more, as the compiler can possibly optimize more for the target platform. The experiments on the Cray system Beskow showed an increased efficiency from 33,3\% to 56\% for the larger problem, illustrating good weak scaling. This suggests that problem sizes should increase accordingly for larger number of nodes in order to achieve high efficiency.Frederick Ceder
    Speedup
    Xeon Phi
    Solver
    Vectorization (mathematics)
    Xeon
    Stencil
    Citations (0)