Fast and simple comparison of semi-structured data, with emphasis on electronic health records

Max Robinson,Jennifer Hadlock,Jiyang Yu,Alireza Khatamian,Aleksandr Y. Aravkin,Eric W. Deutsch,Nathan D. Price,Sui Huang,Gustavo Glusman

Fast and simple comparison of semi-structured data, with emphasis on electronic health records

2018

We present a locality-sensitive hashing strategy for summarizing semi-structured data (e.g., in JSON or XML formats) into 9data fingerprints9: highly compressed representations which cannot recreate details in the data, yet simplify and greatly accelerate the comparison and clustering of semi-structured data by preserving similarity relationships. Computation on data fingerprints is fast: in one example involving complex simulated medical records, the average time to encode one record was 0.53 seconds, and the average pairwise comparison time was 3.75 microseconds. Both processes are trivially parallelizable. Applications include detection of duplicates, clustering and classification of semi-structured data, which support larger goals including summarizing large and complex data sets, quality assessment, and data mining. We illustrate use cases with three analyses of electronic health records (EHRs): (1) pairwise comparison of patient records, (2) analysis of cohort structure, and (3) evaluation of methods for generating simulated patient data.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations