Mitigating linked data quality issues in knowledge-intense information extraction methods

19 June 2017

conference paper
conference paper
Published by Association for Computing Machinery (ACM) in Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics

Abstract

Advances in research areas such as named entity linking and sentiment analysis have triggered the emergence of knowledge-intensive information extraction methods that combine classical information extraction with background knowledge from the Web. Despite data quality concerns, linked data sources such as DBpedia, GeoNames and Wikidata which encode facts in a standardized structured format are particularly attractive for such applications. This paper addresses the problem of data quality by introducing a framework that elaborates on linked data quality issues relevant to different stages of the background knowledge acquisition process, their impact on information extraction performance and applicable mitigation strategies. Applying this framework to named entity linking and data enrichment demonstrates the potential of the introduced mitigation strategies to lessen the impact of different kinds of data quality problems. An industrial use case that aims at the automatic generation of image metadata from image descriptions illustrates the successful deployment of knowledge-intensive information extraction in real-world applications and constraints introduced by data quality concerns.

Keywords

This publication has 24 references indexed in Scilit:

Social‐media‐based public policy informatics: Sentiment and network analyses of U.S. Immigration and border security
Journal of the Association for Information Science and Technology, 2015
Consolidating Heterogeneous Enterprise Data for Named Entity Linking and Web Intelligence
International Journal on Artificial Intelligence Tools, 2015
Analyzing Linked Data Quality with LiQuate
Lecture Notes in Computer Science, 2014
The effect of news and public mood on stock movements
Information Sciences, 2014
Jumping NLP Curves: A Review of Natural Language Processing Research [Review Article]
IEEE Computational Intelligence Magazine, 2014
Improving the Quality of Linked Data Using Statistical Distributions
International Journal on Semantic Web and Information Systems, 2014
Learning multilingual named entity recognition from Wikipedia
Artificial Intelligence, 2013
From names to entities using thematic context distance
Published by Association for Computing Machinery (ACM) ,2011
Entity disambiguation with hierarchical topic models
Published by Association for Computing Machinery (ACM) ,2011
Fast unfolding of communities in large networks
Journal of Statistical Mechanics: Theory and Experiment, 2008

Cited by 3 articles