Improving Dependability of Onboard Deep Learning with Resilient TensorFlow

2021 
As the dawn of a new age in spaceflight approaches, the drive to equip future spacecraft with high-performance computing capabilities is increasing. Many within the industry are looking to leverage solutions enabled by machine learning (ML) and artificial intelligence to enhance mission efficiency. Tasks such as image processing and object tracking are desired for long-duration spaceflight and extravehicular activities. In order to realize these applications in practice, enhancements to onboard processing are needed. ML applications require state-of-the-art processors and hardware accelerators, such as GPUs. However, GPUs are heavily susceptible to radiation-induced single-event effects (SEEs). Additionally, missions require a level of safety-criticality, which is unable to be met by existing commercial-off-the-shelf (COTS) GPUs. In an effort to create an end-to-end solution, this work aims to bridge ML-application development with device-architectural awareness to deliver a fault-aware implementation of the TensorFlow framework called Resilient TensorFlow (RTF). By building customized operations into the TensorFlow framework and employing them within the graph of various models, RTF demonstrates an ability to mask faults that occur during processing while minimizing overhead. Reducing faults during the processing of deep-learning applications brings the space computing industry closer to realizing onboard high-performance computing.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    16
    References
    0
    Citations
    NaN
    KQI
    []