VIFI: Virtual information fabric infrastructure for data-driven discoveries from distributed earth science data

2017 
Traditional data analytics involves manually identifying and downloading relevant distributed datasets of interest to a common server/cluster where the analytics processes are executed. For very large distributed datasets, this slows down the analytics process, and for extremely large datasets it is often impractical to download such massive volumes due to bandwidth limitations. In such cases, data scientists need to be provided explicit access to the remote servers hosting the datasets, and possess detailed knowledge of the server infrastructure and environments, in order to send their analytics packages to the data owner. This alternative poses considerable challenges and has not been adequately addressed to date. In this paper, we describe a novel approach to addressing this challenge called Virtual Information Fabric Infrastructure (VIFI) which seamlessly allows users to conduct analytics-in-place by distributing analytics to the distributed repositories without moving the underlying datasets to a common location. By allowing automated analytics scripts to be sent to the data and orchestration of distributed infrastructure, VIFI allows users to conduct, execute and coordinate complex analytics activities in-place with the data on multiple data repositories. VIFI uses Docker containerization technology along with open-source workflow tool NIFI to achieve automated orchestration and distributed analytics without requiring users to posses detailed knowledge of the distributed repositories and their underlying infrastructure. We demonstrate and evaluate VIFI on a Earth Science use-case for evaluation of precipitation over the Great Plains involving analytics on massive distributed data repositories.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    14
    References
    4
    Citations
    NaN
    KQI
    []