Optimizing data query performance of Bi-cluster for large-scale scientific data in supercomputers

2021 
Scientific exploration and discovery heavily rely on increasing datasets and strong supercomputing power. Surging data pose massive data management challenges in existing data query frameworks. Although many data management techniques have been developed to quickly locate the selected data records, the time and space required to build and store these indexes are often too expensive. To deal with the problem of data location in a parallel file system managing large-scale scientific data, we propose an improved high-performance query data framework called “Bi-cluster+.” In the aspect of index generation, a hierarchical index data structure is designed, which effectively balances index granularity and index construction overhead. According to the characteristics of the index offset, the write load balancing strategy is designed. The hierarchical index is written independently and in parallel. The in situ index generation is optimized by resource constraint analysis. In the aspect of data retrieval, optimization techniques are proposed to improve the query performance. Such as the strategy of the logical data block merging and reading. With the experiments by using multiple scientific datasets on a supercomputer, our optimizations improve data query performance by up to a factor of 1.9 compared with the original Bi-cluster implementation. The scalability of Bi-cluster+ can keep a good performance by evaluating on 17496 cores.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    40
    References
    0
    Citations
    NaN
    KQI
    []