DCache: A Distributed Cache Mechanism for HDFS based on RDMA

2020 
Hadoop Distributed File System (HDFS) is an important component of Hadoop, which provides data storage service. The performance of the IO subsystem has a great influence on data processing efficiency. In the Hadoop system, computing jobs are scheduled to the nodes where the data is located to reduce IO time. The job scheduler and data distribution have a great impact on the efficiency of data processing. HDFS provides a mechanism for users to specify files or directories to be cached. But the cached data is only accessible for the jobs running in the same data nodes. In this paper, we present a distributed cache mechanism for HDFS with RDMA-capable network. We separate the caching function from the original data nodes. We design and implement a new kind of component for the caching function. The cached data in these nodes can be accessed directly by any nodes in the cluster through Remote Direct Memory Access (RDMA). Compared with the existing caching mechanism, our approach improves the performance and stability of the IO process. We also accelerate the data writing process using our caching mechanism. The experiments show that compared with the existing caching mechanism of HDFS, our scheme improves the reading latency up to 3 times, improves the throughput by 30% $\sim$ 95% for large files, and the writing performance by 64%. When the task runs on different nodes, the IO fluctuation caused by different data distribution is dropped from 20.71 to 0.50.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    14
    References
    0
    Citations
    NaN
    KQI
    []