Generating Concise Entity Matching Rules

2017 
Entity matching (EM) is a critical part of data integration and cleaning. In many applications, the users need to understand why two entities are considered a match, which reveals the need for interpretable and concise EM rules. We model EM rules in the form of General Boolean Formulas ( GBF s) that allows arbitrary attribute matching combined by conjunctions (∨), disjunctions (∧), and negations. (¬) GBF s can generate more concise rules than traditional EM rules represented in disjunctive normal forms ( DNF s). We use program synthesis , a powerful tool to automatically generate rules (or programs) that provably satisfy a high-level specification, to automatically synthesize EM rules in GBF format, given only positive and negative matching examples. In this demo, attendees will experience the following features: (1) Interpretability -- they can see and measure the conciseness of EM rules defined using GBF s; (2) Easy customization -- they can provide custom experiment parameters for various datasets, and, easily modify a rich predefined (default) synthesis grammar, using a Web interface; and (3) High performance -- they will be able to compare the generated concise rules, in terms of accuracy, with probabilistic models (e.g., machine learning methods), and hand-written EM rules provided by experts. Moreover, this system will serve as a general platform for evaluating different methods that discover EM rules, which will be released as an open-source tool on GitHub.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    6
    References
    42
    Citations
    NaN
    KQI
    []