Job scheduling for distributed machine learning in optical WAN

2020 
Abstract Large companies operate tens of data centers (DCs) across the globe to serve their customers and store data. On the other hand, many machine learning applications need a global view of such global data to pursue high model accuracy. However, for this Geo-distributed machine learning (Geo-DML), it is infeasible to move all data together over wide-area networks (WANs) due to scarce WAN bandwidth, privacy concerns and data sovereignty laws. Therefore, most Geo-DML systems leverage geo-distributed approaches to train models, where global model synchronization is required between DCs over WAN. With the rapid increase of training data and the model sizes, it is challenging to efficiently utilize scarce and heterogeneous WAN bandwidth to synchronize models. With the advancement of optical technology, network topology becomes reconfigurable in optical WAN, which brings a new opportunity for Geo-DML training over WAN. We propose to optimize Geo-DML training with centralized joint control of the network and reconfigurable optical layers. We respectively prove the intra-job and inter-job scheduling problems are NP-hard and strongly NP-hard. For intra-job scheduling, RoWAN based on deterministic rounding algorithm, is presented to dynamically change the topology by reconfiguring the optical devices, and allocate path and rate for each flow. For inter-job scheduling, delayed SWRT is provided to schedule multiple jobs according to their priorities. The simulations in real topologies show that RoWAN reduces global model synchronization communication time of single iteration by up to 15.54%-48.2% on average in comparison with the traditional solutions. Compared to other three inter-job scheduling approaches, delayed SWRT can reduce the weighted job completion time (WJCT) by about 60%, 44.8% and 28.76%.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    42
    References
    1
    Citations
    NaN
    KQI
    []