Designing Scalable and High-Performance MPI Libraries on Amazon Elastic Fabric Adapter

2019 
Amazon has recently announced a new network interface named Elastic Fabric Adapter (EFA) targeted towards tightly coupled HPC workloads. In this paper, we characterize the features, capabilities and performance of the adapter. We also explore how its transport models such as UD and SRD (Scalable Reliable Datagram) impact the design of high-performance MPI libraries. Our evaluations show that hardware level reliability provided by SRD can significantly improve the performance of MPI communication. We also propose a new zero-copy transfer mechanism over unreliable and orderless channels that can reduce the communication latency of large messages. The proposed design also shows significant improvement in collective and application performance against the vendor provided MPI library.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    15
    References
    4
    Citations
    NaN
    KQI
    []