Small files’ problem in Hadoop: A systematic literature review
2021
Abstract Apache Hadoop is an open-source software library which integrates a wide variety of software tools and utilities to facilitate the distributed batch processing of big data sets. Hadoop ecosystem comprises two major components - Hadoop Distributed File System (HDFS), which is primarily used for storage and MapReduce, which is primarily used for processing of the files. The performance of Hadoop holds back when it comes to storage and processing of small files. Small files are essentially the files that are significantly smaller in size when compared to the default block size of HDFS. This is because each small file consumes a block individually leading to excessive memory requirement, access time and processing time. Scaling the memory, allowing access latencies and processing delays beyond a limit is not an option. Henceforth, in this paper, a Systematic Literature Review has been performed to provide a comprehensive and exhaustive gestalt of the small files' problem in Hadoop. The paper defines a comprehensive taxonomy of Hadoop ecosystem and its’ existent small files problem. Further, the study also attempts to critically analyze the solutions that have been proposed to overcome this problem. These solutions have been analytically studied to identify the set of parameters that should be considered in impending while proposing solutions pertaining to this problem.
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
91
References
0
Citations
NaN
KQI