Record linkage

Record linkage (RL) is the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, and databases). Record linkage is necessary when joining data sets based on entities that may or may not share a common identifier (e.g., database key, URI, National identification number), which may be due to differences in record shape, storage location, or curator style or preference. A data set that has undergone RL-oriented reconciliation may be referred to as being cross-linked. Record linkage is called data linkage in many jurisdictions, but is the same process. Record linkage (RL) is the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, and databases). Record linkage is necessary when joining data sets based on entities that may or may not share a common identifier (e.g., database key, URI, National identification number), which may be due to differences in record shape, storage location, or curator style or preference. A data set that has undergone RL-oriented reconciliation may be referred to as being cross-linked. Record linkage is called data linkage in many jurisdictions, but is the same process. The initial idea of record linkage goes back to Halbert L. Dunn in his 1946 article titled 'Record Linkage' published in the American Journal of Public Health. Howard Borden Newcombe laid the probabilistic foundations of modern record linkage theory in a 1959 article in Science, which were then formalized in 1969 by Ivan Fellegi and Alan Sunter who proved that the probabilistic decision rule they described was optimal when the comparison attributes were conditionally independent. Their pioneering work 'A Theory For Record Linkage' remains the mathematical foundation for many record linkage applications even today. Since the late 1990s, various machine learning techniques have been developed that can, under favorable conditions, be used to estimate the conditional probabilities required by the Fellegi-Sunter (FS) theory. Several researchers have reported that the conditional independence assumption of the FS algorithm is often violated in practice; however, published efforts to explicitly model the conditional dependencies among the comparison attributes have not resulted in an improvement in record linkage quality. On the other hand, machine learning or neural network algorithms that do not rely on these assumptions often provide far higher accuracy, when sufficient labeled training data is available. Record linkage can be done entirely without the aid of a computer, but the primary reasons computers are often used for record linkage are to reduce or eliminate manual review and to make results more easily reproducible. Computer matching has the advantages of allowing central supervision of processing, better quality control, speed, consistency, and better reproducibility of results. 'Record linkage' is the term used by statisticians, epidemiologists, and historians, among others, to describe the process of joining records from one data source with another that describe the same entity. Commercial mail and database applications refer to it as 'merge/purge processing' or 'list washing'. Computer scientists often refer to it as 'data matching' or as the 'object identity problem'. Other names used to describe the same concept include: 'coreference/entity/identity/name/record resolution', 'entity disambiguation/linking', 'fuzzy matching', 'duplicate detection', 'deduplication', 'record matching', '(reference) reconciliation', 'object identification', 'data/information integration' and 'conflation'. This profusion of terminology has led to few cross-references between these research communities. While they share similar names, record linkage and Linked Data are two separate approaches to processing and structuring data. Although both involve identifying matching entities across different data sets, record linkage standardly equates 'entities' with human individuals; by contrast, Linked Data is based on the possibility of interlinking any web resource across data sets, using a correspondingly broader concept of identifier, namely a URI. Record linkage is highly sensitive to the quality of the data being linked, so all data sets under consideration (particularly their key identifier fields) should ideally undergo a data quality assessment prior to record linkage. Many key identifiers for the same entity can be presented quite differently between (and even within) data sets, which can greatly complicate record linkage unless understood ahead of time. For example, key identifiers for a man named William J. Smith might appear in three different data sets as so: In this example, the different formatting styles lead to records that look different but in fact all refer to the same entity with the same logical identifier values. Most, if not all, record linkage strategies would result in more accurate linkage if these values were first normalized or standardized into a consistent format (e.g., all names are 'Surname, Given name', and all dates are 'YYYY/MM/DD'). Standardization can be accomplished through simple rule-based data transformations or more complex procedures such as lexicon-based tokenization and probabilistic hidden Markov models. Several of the packages listed in the Software Implementations section provide some of these features to simplify the process of data standardization. Entity resolution is an operational intelligence process, typically powered by an entity resolution engine or middleware, whereby organizations can connect disparate data sources with a view to understanding possible entity matches and non-obvious relationships across multiple data silos. It analyzes all of the information relating to individuals and/or entities from multiple sources of data, and then applies likelihood and probability scoring to determine which identities are a match and what, if any, non-obvious relationships exist between those identities.

Parent Topic

Child Topic

No Parent Topic