Inferring Regular Expressions with Interleaving from XML Data

2018 
Document Type Definition (DTD) and XML Schema Definition (XSD) are two popular schema languages for XML. However, many XML documents in practice are not accompanied by a schema, or by a valid schema. Therefore, it is essential to devise efficient algorithms for schema learning. Schema learning can be reduced to the inference of restricted regular expressions. In this paper, we first propose a new subclass of restricted regular expressions called Various CHAin Regular Expression with Interleaving (VCHARE). Then based on single occurrence automaton (SOA) and maximum independent set (MIS), we introduce an inference algorithm GenVCHARE. The algorithm has been proved to infer a descriptive generalized VCHARE from a set of given sample. Finally, we conduct a series of experiments based on our data set crawled from the Web. The experimental results show that VCHARE can cover more content models than other existing subclasses of regular expressions. And, based on the data sets of DBLP, regular expressions inferred by GenVCHARE are more accurate and concise compared with other existing methods.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    36
    References
    0
    Citations
    NaN
    KQI
    []