Orchestrating deep learning workloads on distributed infrastructure

Seetharami R. Seelam,Yubo Li

Orchestrating deep learning workloads on distributed infrastructure

2017

Containers simplify the packaging, deployment and orchestration of diverse workloads on distributed infrastructure. Containers are primarily used for web applications, databases, application servers, etc. on infrastructure that consists of CPUs, Memory, Network and Storage. Accelerator hardware such GPUs are needed for emerging class of deep learning workloads with unique set of requirements that are not addressed by current container orchestration systems like Mesos, Kuberentes, and Docker swarm. In this extended abstract, we discuss the requirements to support GPUs in container management systems and describe our solutions in Kubernetes. We will conclude with a set of open issues that are yet to be addressed to fully support deep learning workloads on distributed infrastructure. Operating system (OS) allows flexible sharing and load balancing of resources like CPU, Memory and Network among multiple processes and containers. Unlike these GPU's are unique quantities (GPU 0, GPU 1, ...) and they must be allocated accordingly (e.g., allocate GPU 0 to Container 1). GPU topology, will heavily affect the bandwidth of GPU to GPU communication and must take into consideration. Moreover, GPU topology even affects GPU capabilities. In some systems, for example, GPUs on different CPU socket cannot have Peer to Peer communication capability. To address these issues, firstly, we have enabled GPU support on Kubernetes. We implemented a GPU allocator module to record GPU number-to-device mapping. Kubernetes users only request number of GPUs need for their workload; GPU allocator module maps the number to actual GPU devices according to required scheduling policy and expose the allocated GPUs to application inside the container. Secondly, we have developed two advanced GPU schedulers, a bin-packing scheduler and a topology-aware scheduler, to improve GPU utilization and GPU performance. Bin-packing scheduler tries to bundle GPU jobs to fewer servers, so that other idle servers can be reserved for potentially large jobs. Topology-aware scheduler can automatically collect GPU topology information of each worker node, and assign nodes that deliver the highest possible bandwidth to the application. Access to CPU, Memory, Network and Storage devices are abstracted by operating system (OS) application programming interface (API) calls. The OS translates the application calls into device specific calls internally. Unlike these resources, GPUs have device access calls that are not yet abstracted under OS API's so applications that require access to GPU devices need those GPU devices mounted inside the container, they need access to auxiliary device interfaces (like nvidia-uvm), and they need the GPU drivers inside that container. The device driver inside the container must exactly match the driver on the host for proper operation. To solve these issues, we enhanced Kubernetes to gather the device drivers on kubelet startup and mount these drivers into the container automatically. This ensures portability of the workloads across systems with potentially different drivers. A similar approach is taken by Mesos, Nvidia Docker and other systems. In addition, unlike CPU and memory, GPU is an external PCIe device and in our experience it experiences software and hardware failures far more frequently than the rest of the system. Failures could include bad connection to the PCIe slot, GPU kernel crash, bad power supply, and so on. To deal with such issues, we enabled GPU liveness check on Kubernetes (like the liveness check in any cloud service). The kubelet periodically checks the healthiness of GPU devices. Once the GPU failure is detected, the GPU will automatically be removed out from the resources pool. Finally to support multiple users, we added GPU quota support in Kubernetes so that GPU resources can be limited by different namespaces; We auto-labeled GPU devices model to Kubernetes worker nodes so that the job can use such information to filter GPUs. All these new features are originated from our real requirements and aim to enhance the usability of GPUs in a cloud context. We consider GPU related scheduling policy and algorithm to improve both GPU performance and utility as open issues. We plan to extend CPU topology and affinity support to Kubernetes, so that we can make CPU-GPU joint topology optimized scheduling. The exploration of CPU-GPU bandwidth will bring more possibilities for performance improvement especially for servers with NVLink technology (like IBM Minsky) between CPU-GPU.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations