CLX: Towards a scalable and comprehensible design of PBE data transformations

2018 
Effective data analytics on data collected from the real world usually begins with a notoriously expensive pre-processing step of data transformation and wrangling. Human-in-the-loop tools have been proposed to speed up the process of data transformation, using the Programming By Example (PBE) approach. However, two important usability issues limit the effective use of such PBE data transformation systems: (1) the cost of user effort grows quickly as volume or heterogeneity of the raw data increases (prohibitive user effort), and (2) the underlying process of transformation is opaque to the user and hence difficult to validate, correct and debug (incomprehensibility). In this project, we propose a new PBE data transformation paradigm design CLX (pronounced "clicks") for data normalization to address these two issues. For the issue of prohibitive user effort, we present a pattern profiling algorithm that hierarchically clusters the input raw data based on format structures that help the user quickly identify both well-formatted and ill-formatted data and specify the desired format. After the desired transformation logic is inferred, CLX explains it as a set of simple regular expression replacement operations to improve comprehensibility. We experimentally compared the CLX prototype with FlashFill, a state-of-the-art data transformation tool. The results show improvements over the state of the art in saving user effort and enhancing comprehensibility, without loss of efficiency or expressive power. In a user effort study on data sets of various sizes, when the data size grew by a factor of 30, the user effort required by the CLX prototype grew 1.2x whereas that required by FlashFill grew 9.1x. In another test assessing the users' understanding of the transformation logic, the CLX users achieved a success rate about twice that of the FlashFill users.
    • Correction
    • Cite
    • Save
    • Machine Reading By IdeaReader
    31
    References
    0
    Citations
    NaN
    KQI
    []