Job Scheduling with Minimizing Data Communication Costs

2015 
The research presented in this paper analyzes different algorithms for scheduling a set of potentially interdependent jobs in order to minimize the total runtime, or makespan, when data communication costs are considered. On distributed file systems, such as the HDFS, files are spread across a cluster of machines. Once a request, such as a query, having as input the data in these files is translated into a set of jobs, these jobs must be scheduled across machines in the cluster. Jobs consume input files stored in the distributed file system or in cluster nodes, and produce output which is potentially consumed by future jobs. If a job needs a particular file as input, the job must either be executed on the same machine, or it must incur a time penalty to copy the file, increasing latency for the job. At the same time, independent jobs are ideally scheduled at the same time on different machines, in order to take advantage of parallelism. Both minimizing communication costs and maximizing parallelism serve to minimize the total runtime of a set of jobs. Furthermore, the problem gets more complex when certain jobs must wait for previous jobs to provide input, as is frequently the case when a set of jobs represents the steps of a query on a distributed database.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    5
    References
    2
    Citations
    NaN
    KQI
    []