Learning Disjunctive Multiplicity Expressions and Disjunctive Generalize Multiplicity Expressions From Both Positive and Negative Examples

18 April 2022

journal article
research article
Published by Oxford University Press (OUP) in The Computer Journal

Vol. 66 (7), 1733-1748
https://doi.org/10.1093/comjnl/bxac037

Abstract

The presence of a schema for eXtensible Markup Language (XML) documents has numerous advantages. Unfortunately, many XML documents in practice are not accompanied by a (valid) schema. Therefore, it is essential to devise algorithms to infer schemas from XML documents, where the fundamental task is learning regular expressions. In this paper, we focus on the learning of disjunctive multiplicity expressions (DMEs), a subclass of regular expressions that are particularly suitable to specify unordered models and have been used as the foundation of the schemas for unordered XML. Previous work for learning DME lacks inference algorithms that support positive and negative examples. Further, presently there has been no algorithm can learn DMEs extended with numeric occurrences. We address these challenges in the present paper and first propose a novel algorithm to learn DMEs from positive and negative examples by using genetic algorithms and parallel techniques. Then we extend DMEs to disjunctive generalized multiplicity expressions (DGMEs), which allow numeric occurrences and develop an algorithm to learn DGMEs from positive and negative examples. Finally, experimental results show that with only positive examples, our algorithm can generate a DME with an acceptable learning time, which can accept all positive examples, and when given both positive and negative examples, we can learn DMEs or DGMEs with high accuracy.

Keywords

Funding Information

National Natural Science Foundation of China (61872339, 61472405)

This publication has 27 references indexed in Scilit:

Regular Expressions with Counting: Weak versus Strong Determinism
SIAM Journal on Computing, 2012
Update rewriting and integrity constraint maintenance in a schema evolution support system
Proceedings of the VLDB Endowment, 2010
Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data
ACM Transactions on the Web, 2010
Algorithms for learning regular expressions from positive data
Information and Computation, 2009
XPath satisfiability in the presence of DTDs
Journal of the ACM, 2008
On the minimization of XPath queries
Journal of the ACM, 2008
Managing Semi-Structured Data
Queue, 2005
Containment and equivalence for a fragment of XPath
Journal of the ACM, 2004
Tree pattern query minimization
The VLDB Journal, 2002
One-Unambiguous Regular Languages
Information and Computation, 1998

Cited by 1 article