A Novel Optimization Method to Improve De-duplication Storage System Performance

2009 
Data De-duplication has become a commodity component in data-intensive storage systems. But compared with other traditional storage paradigms, de-duplication system achieves elimination of data duplications or redundancies at the cost of bringing several additional layers or function components into the I/O path, and these additional components are either CPU-intensive or I/O intensive, largely hindering the overall system performance. Direct against the above potential system bottlenecks, this paper quantitatively analyzes the overhead of each main component introduced by de-duplication, and then proposes two performance optimization methods. The one is parallel calculation of content aware chunk identifiers, which fully utilizes the parallelism both inter and intra chunks by using a certain task partition and chunk content distribution algorithm. Experiments demonstrate that it can improve up to 150% of the system throughput, and at the same time much better utilize the multiprocessor resources. The other one is storage pipelining, which overlaps the CPU-bound, I/O-bound and network communication tasks. Through a dedicated five-stage storage pipeline design for file archival operations, experimental results show that the system throughput can increase up to 25% according to our workloads.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    19
    References
    12
    Citations
    NaN
    KQI
    []