Interactive and Deterministic Data Cleaning
- 14 June 2016
- conference paper
- conference paper
- Published by Association for Computing Machinery (ACM)
- p. 893-907
- https://doi.org/10.1145/2882903.2915242
Abstract
We present Falcon, an interactive, deterministic, and declarative data cleaning system, which uses SQL update queries as the language to repair data. Falcon does not rely on the existence of a set of pre-defined data quality rules. On the contrary, it encourages users to explore the data, identify possible problems, and make updates to fix them. Bootstrapped by one user update, Falcon guesses a set of possible SQL update queries that can be used to repair the data. The main technical challenge addressed in this paper consists in finding a set of SQL update queries that is minimal in size and at the same time fixes the largest number of errors in the data. We formalize this problem as a search in a lattice-shaped space. To guarantee that the chosen updates are semantically correct, Falcon navigates the lattice by interacting with users to gradually validate the set of SQL update queries. Besides using traditional one-hop based traverse algorithms (e.g., BFS or DFS), we describe novel multi-hop search algorithms such that Falcon can dive over the lattice and conduct the search efficiently. Our novel search strategy is coupled with a number of optimization techniques to further prune the search space and efficiently maintain the lattice. We have conducted extensive experiments using both real-world and synthetic datasets to show that Falcon can effectively communicate with users in data repairingKeywords
This publication has 42 references indexed in Scilit:
- Efficient discovery of similarity constraints for matching dependenciesData & Knowledge Engineering, 2013
- Playful query specification with DataPlayProceedings of the VLDB Endowment, 2012
- Towards certain fixes with editing rules and master dataThe VLDB Journal, 2011
- LIBSVMACM Transactions on Intelligent Systems and Technology, 2011
- Guided data repairProceedings of the VLDB Endowment, 2011
- Discovering data quality rulesProceedings of the VLDB Endowment, 2008
- Conditional functional dependencies for capturing data inconsistenciesACM Transactions on Database Systems, 2008
- Answering queries using views: A surveyThe VLDB Journal, 2001
- Offline List Update is NP-hardLecture Notes in Computer Science, 2000
- Amortized efficiency of list update and paging rulesCommunications of the ACM, 1985