A Model and System for Querying Provenance from Data Cleaning Workflows

2020 
Data cleaning is an essential component of data preparation in machine learning and other data science workflows, and is widely recognized as the most time-consuming and error-prone part when working with real-world data. How data was prepared and cleaned has a significant impact on the reliability and trustworthiness of results of any subsequent analysis. Transparent data cleaning not only requires that provenance (i.e., operation history and value changes) be captured, but also that those changes are easy to explore and evaluate: The data scientists who prepare the data, as well as others who want to reuse the cleaned data for their studies, need to be able to easily explore and query its data cleaning history. We have developed a domain-specific provenance model for data cleaning that supports the kind of provenance questions that data scientists need to answer when inspecting and debugging data preparation histories. The design of the model was driven by the need (i) to answer relevant, user-oriented provenance questions, and (ii) to do so in an effective and efficient manner. The model is a refinement of an earlier provenance model and has been implemented as a companion tool to OpenRefine, a popular, open source tool for data cleaning.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    13
    References
    1
    Citations
    NaN
    KQI
    []