Learning to Rerank Schema Matches

27 December 2019

journal article
research article
Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Knowledge and Data Engineering

Vol. 33 (8), 3104-3116
https://doi.org/10.1109/tkde.2019.2962124

Abstract

Schema matching is at the heart of integrating structured and semi-structured data with applications in data warehousing, data analysis recommendations, Web table matching, etc. Schema matching is known as an uncertain process and a common method to overcome this uncertainty introduces a human expert with a ranked list of possible schema matches to choose from, known as top-K matching. In this work we propose a learning algorithm that utilizes an innovative set of features to rerank a list of schema matches and improves upon the ranking of the best match. We provide a bound on the size of an initial match list, tying the number of matches with a desired level of confidence in finding the best match. We also propose the use of matching predictors as features in a learning task, and tailored nine new matching predictors for this purpose. The proposed algorithm assists the matching process by introducing a quality set of alternative matches to a human expert. It also serves as a step towards eliminating the involvement of human experts as decision makers in a matching process altogether. A large scale empirical evaluation with real-world benchmark shows the effectiveness of the proposed algorithmic solution.

Keywords

Funding Information

JPMorgan Chase and Company

This publication has 35 references indexed in Scilit:

Data integration with uncertainty
The VLDB Journal, 2008
Learning to Rank for Information Retrieval
Foundations and Trends® in Information Retrieval, 2007
Automatic complex schema matching across Web query interfaces
ACM Transactions on Database Systems, 2006
Flexible pseudo-relevance feedback via selective sampling
ACM Transactions on Asian Language Information Processing, 2005
A framework for modeling and evaluating automatic semantic reconciliation
The VLDB Journal, 2005
A scaled difference chi-square test statistic for moment structure analysis
Psychometrika, 2001
An $O(EV\log V)$ Algorithm for Finding a Maximal Weighted Matching in General Graphs
SIAM Journal on Computing, 1986
A rationale and test for the number of factors in factor analysis
Psychometrika, 1965
Multiple Comparisons among Means
Journal of the American Statistical Association, 1961
A New Measure of Rank Correlation
Biometrika, 1938

Cited by 3 articles