Learning Disjunctive Multiplicity Expressions and Disjunctive Generalize Multiplicity Expressions From Both Positive and Negative Examples

Abstract
The presence of a schema for eXtensible Markup Language (XML) documents has numerous advantages. Unfortunately, many XML documents in practice are not accompanied by a (valid) schema. Therefore, it is essential to devise algorithms to infer schemas from XML documents, where the fundamental task is learning regular expressions. In this paper, we focus on the learning of disjunctive multiplicity expressions (DMEs), a subclass of regular expressions that are particularly suitable to specify unordered models and have been used as the foundation of the schemas for unordered XML. Previous work for learning DME lacks inference algorithms that support positive and negative examples. Further, presently there has been no algorithm can learn DMEs extended with numeric occurrences. We address these challenges in the present paper and first propose a novel algorithm to learn DMEs from positive and negative examples by using genetic algorithms and parallel techniques. Then we extend DMEs to disjunctive generalized multiplicity expressions (DGMEs), which allow numeric occurrences and develop an algorithm to learn DGMEs from positive and negative examples. Finally, experimental results show that with only positive examples, our algorithm can generate a DME with an acceptable learning time, which can accept all positive examples, and when given both positive and negative examples, we can learn DMEs or DGMEs with high accuracy.
Funding Information
  • National Natural Science Foundation of China (61872339, 61472405)