language-icon Old Web
English
Sign In

Data differencing

In computer science and information theory, data differencing or differential compression is producing a technical description of the difference between two sets of data – a source and a target. Formally, a data differencing algorithm takes as input source data and target data, and produces difference data such that given the source data and the difference data, one can reconstruct the target data ('patching' the source with the difference to produce the target). In computer science and information theory, data differencing or differential compression is producing a technical description of the difference between two sets of data – a source and a target. Formally, a data differencing algorithm takes as input source data and target data, and produces difference data such that given the source data and the difference data, one can reconstruct the target data ('patching' the source with the difference to produce the target). One of the best-known examples of data differencing is the diff utility, which produces line-by-line differences of text files (and in some implementations, binary files, thus being a general-purpose differencing tool). Differencing of general binary files goes under the rubric of delta encoding, with a widely used example being the algorithm used in rsync. A standardized generic differencing format is VCDIFF, implemented in such utilities as Xdelta version 3. A high-efficiency (small patch files) differencing program is bsdiff, which is based on bzip2 compression, demonstrating the close connection between differencing and compression. Main concerns for data differencing are usability and space efficiency (patch size). If one simply wishes to reconstruct the target given the source and patch, one may simply include the entire target in the patch and 'apply' the patch by discarding the source and outputting the target that has been included in the patch; similarly, if the source and target have the same size one may create a simple patch by XORing source and target. In both these cases, the patch will be as large as the target. As these examples show, if the only concern is reconstruction of target, this is easily done, at the expense of a large patch, and the main concern for general-purpose binary differencing is reducing the patch size. For structured data especially, one has other concerns, which largely fall under 'usability' – for example, if one is comparing two documents, one generally wishes to know which sections have changed, or if some sections have been moved around – one wishes to understand how the documents differ. For instance 'here 'cat' was changed to 'dog', and paragraph 13 was moved to paragraph 14'. One may also wish to have robust differences – for example, if two documents A and B differ in paragraph 13, one may wish to be able to apply this patch even if one has changed paragraph 7 of A. An example of this is in diff, which shows which lines changed, and where the context format allows robustness and improves human readability.

[ "Algorithm", "Database", "Theoretical computer science", "Statistics", "Data mining" ]
Parent Topic
Child Topic
    No Parent Topic