A GPU Scheduling Framework to Accelerate Hyper-Parameter Optimization in Deep Learning Clusters

Jaewon Son,Yonghyuk Yoo,Khu Rai Kim,Youngjae Kim,Kwonyong Lee,Sungyong Park

A GPU Scheduling Framework to Accelerate Hyper-Parameter Optimization in Deep Learning Clusters

2021

This paper proposes Hermes, a container-based preemptive GPU scheduling framework for accelerating hyper-parameter optimization in deep learning (DL) clusters. Hermes accelerates hyper-parameter optimization by time-sharing between DL jobs and prioritizing jobs with more promising hyper-parameter combinations. Hermes’s scheduling policy is grounded on the observation that good hyper-parameter combinations converge quickly in the early phases of training. By giving higher priority to fast-converging containers, Hermes’s GPU preemption mechanism can accelerate training. This enables users to find optimal hyper-parameters faster without losing the progress of a container. We have implemented Hermes over Kubernetes and compared its performance against existing scheduling frameworks. Experiments show that Hermes reduces the time for hyper-parameter optimization up to 4.04 times against previously proposed scheduling policies such as FIFO, round-robin (RR), and SLAQ, with minimal time-sharing overhead.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations