A Novel Optimization Method to Improve De-duplication Storage System Performance
2009
Data De-duplication has become a commodity component in data-intensive storage systems. But compared with other traditional storage paradigms, de-duplication system achieves elimination of data duplications or redundancies at the cost of bringing several additional layers or function components into the I/O path, and these additional components are either CPU-intensive or I/O intensive, largely hindering the overall system performance. Direct against the above potential system bottlenecks, this paper quantitatively analyzes the overhead of each main component introduced by de-duplication, and then proposes two performance optimization methods. The one is parallel calculation of content aware chunk identifiers, which fully utilizes the parallelism both inter and intra chunks by using a certain task partition and chunk content distribution algorithm. Experiments demonstrate that it can improve up to 150% of the system throughput, and at the same time much better utilize the multiprocessor resources. The other one is storage pipelining, which overlaps the CPU-bound, I/O-bound and network communication tasks. Through a dedicated five-stage storage pipeline design for file archival operations, experimental results show that the system throughput can increase up to 25% according to our workloads.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
19
References
12
Citations
NaN
KQI