Learning Concise Relax NG Schemas Supporting Interleaving from XML Documents

2018 
Relax NG is a popular and powerful schema language for XML, which concerns the relative order among the elements. Since many XML documents in practice either miss schemas or lack valid schemas, we focus on inferring a concise Relax NG from some XML documents. The fundamental task of Relax NG inference is learning regular expressions. Previous work in this direction lacks support of all operators allowed in Relax NG especially for interleaving. In this paper, by analysis of large-scale real-world Relax NG, we propose a restricted subclass of regular expressions called chain regular expressions with interleaving (ICREs). Meanwhile, we develop a learning algorithm to infer a descriptive generalized ICRE from XML samples, based on single occurrence automata and the maximum clique. We conduct experiments on real benchmark from DBLP. Experimental results show that ICREs are expressive enough to cover the vast majority of practical Relax NG. Our algorithm can effectively learn from small and large dataset, and our results are concise and more precise than other popular methods.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    38
    References
    7
    Citations
    NaN
    KQI
    []