Discovering Similar Workflows via Provenance Clustering: A Case Study

2018 
Several workflow management systems and scripting languages have adopted provenance tracking, yet many researchers choose to manually capture or instrument their processing scripts to write provenance information to files. The Next Generation Sequencing (NGS) project we are associated with is tracking provenance in such manner. The NGS project is a collaboration between multiple groups at different sites, where each group is collecting and processing samples using an agreed-upon workflow. The workflow contains many stages with varying degrees of complexity. Over time workflow stages are modified, but data samples are only comparable when processed with identical versions of the workflow. However, for various reasons (including the distributed nature of the collaboration) it is not always clear which samples have been processed with which version of the workflow. In this paper, we introduce new techniques for clustering provenance datasets and attempt to discover the ones that are likely to be generated by same workflow. Based on the clustering result, users can identify similar provenance and would be able to categorize them into different clusters for debugging and zoom-in/zoom-out viewing.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    14
    References
    4
    Citations
    NaN
    KQI
    []