Low-Reliable Low-Latency Networks Optimized for HPC Parallel Applications

2018 
High-end network standards, such as 400GbE, have been introduced Forwarding Error Correction (FEC) for maintaining the same bit error rate (BER) as that in traditional low-bandwidth interconnection networks. However, FEC operation latency overhead surprisingly becomes higher than the sum of all the other switch operation overheads, e.g., routing computation and switch allocation. FEC operation latency overhead significantly degrades the performance of parallel applications in HPC systems. Instead, in this study, we exploit the low-latency network design using a Hamming code that does not provide rigid error-free communication. Since it is consistent with existing frame format based on standard Reed-Solomon RS(544,514) with DC(64b/66b) direct linecode and TC(256b/257b) transcode, respectively, the influences upon the other network layer design are limited. Interestingly, a large number of parallel applications can accept the BER in such a Hamming code. Since lowering such a BER improves switch operation latency, the proposed network using the Hamming code improves the execution time of NAS Parallel Benchmarks by 56% on average when compared to the counterpart RS-FEC networks.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    9
    References
    4
    Citations
    NaN
    KQI
    []