Normalized compression distance

Normalized compression distance (NCD) is a way of measuring the similarity between two objects, be it two documents, two letters, two emails, two music scores, two languages, two programs, two pictures, two systems, two genomes, to name a few. Such a measurement should not be application dependent or arbitrary. A reasonable definition for the similarity between two objects is how difficult it is to transform them into each other. Normalized compression distance (NCD) is a way of measuring the similarity between two objects, be it two documents, two letters, two emails, two music scores, two languages, two programs, two pictures, two systems, two genomes, to name a few. Such a measurement should not be application dependent or arbitrary. A reasonable definition for the similarity between two objects is how difficult it is to transform them into each other. It can be used in information retrieval and data mining for cluster analysis. We assume that the objects one talks about are finite strings of 0s and 1s. Thus we mean string similarity. Every computer file is of this form, that is, if an object is a file in a computer it is of this form. One can define the information distance between strings x {displaystyle x} and y {displaystyle y} as the length of the shortest program p {displaystyle p} that computes x {displaystyle x} from y {displaystyle y} and vice versa. This shortest program is in a fixed programming language. For technical reasons one uses the theoretical notion of Turing machines. Moreover, to express the length of p {displaystyle p} one uses the notion of Kolmogorov complexity. Then, it has been shown up to logarithmic additive terms which can be ignored. This information distance is shown to be a metric(it satisfies the metric inequalities up to a logarithmic additive term), is universal (it minorizes every computable distance as computed for example from features up to a constant additive term).

Parent Topic

Child Topic

No Parent Topic