language-icon Old Web
English
Sign In

Damerau–Levenshtein distance

In information theory and computer science, the Damerau–Levenshtein distance (named after Frederick J. Damerau and Vladimir I. Levenshtein) is a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other. In information theory and computer science, the Damerau–Levenshtein distance (named after Frederick J. Damerau and Vladimir I. Levenshtein) is a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other. The Damerau–Levenshtein distance differs from the classical Levenshtein distance by including transpositions among its allowable operations in addition to the three classical single-character edit operations (insertions, deletions and substitutions). In his seminal paper, Damerau stated that more than 80% of all human misspellings can be expressed by a single error of one of the four types. Damerau's paper considered only misspellings that could be corrected with at most one edit operation. While the original motivation was to measure distance between human misspellings to improve applications such as spell checkers, Damerau–Levenshtein distance has also seen uses in biology to measure the variation between protein sequences. To express the Damerau–Levenshtein distance between two strings a {displaystyle a} and b {displaystyle b} a function d a , b ( i , j ) {displaystyle d_{a,b}(i,j)} is defined, whose value is a distance between an i {displaystyle i} –symbol prefix (initial substring) of string a {displaystyle a} and a j {displaystyle j} –symbol prefix of b {displaystyle b} . The restricted distance function is defined recursively as:,:A:11 d a , b ( i , j ) = min { 0 if  i = j = 0 d a , b ( i − 1 , j ) + 1 if  i > 0 d a , b ( i , j − 1 ) + 1 if  j > 0 d a , b ( i − 1 , j − 1 ) + 1 ( a i ≠ b j ) if  i , j > 0 d a , b ( i − 2 , j − 2 ) + 1 if  i , j > 1  and  a [ i ] = b [ j − 1 ]  and  a [ i − 1 ] = b [ j ] {displaystyle qquad d_{a,b}(i,j)=min {egin{cases}0&{ ext{if }}i=j=0\d_{a,b}(i-1,j)+1&{ ext{if }}i>0\d_{a,b}(i,j-1)+1&{ ext{if }}j>0\d_{a,b}(i-1,j-1)+1_{(a_{i} eq b_{j})}&{ ext{if }}i,j>0\d_{a,b}(i-2,j-2)+1&{ ext{if }}i,j>1{ ext{ and }}a=b{ ext{ and }}a=b\end{cases}}} where 1 ( a i ≠ b j ) {displaystyle 1_{(a_{i} eq b_{j})}} is the indicator function equal to 0 when a i = b j {displaystyle a_{i}=b_{j}} and equal to 1 otherwise.

[ "String metric", "Approximate string matching", "Wagner–Fischer algorithm", "String-to-string correction problem" ]
Parent Topic
Child Topic
    No Parent Topic