Journal of Data and Information Quality

Journal Information
ISSN / EISSN : 1936-1955 / 1936-1963
Total articles ≅ 238
Current Coverage

Latest articles in this journal

Donatello Santoro, Saravanan Thirumuruganathan, Paolo Papotti
Journal of Data and Information Quality;

This editorial summarizes the content of the Special Issue on Deep Learning for Data Quality of the Journal of Data and Information Quality (JDIQ).
, , , Stefan Grundmann, Anja Lehmann, Herbert Zech
Journal of Data and Information Quality;

This vision paper outlines the main building blocks of what we term AI Compliance , an effort to bridge two complementary research areas: computer science and the law. Such research has the goal to model, measure, and affect the quality of AI artifacts, such as data, models and applications, to then facilitate adherence to legal standards.
Dennis Gram, Pantelis Karapanagiotis, Marius Liebald, Uwe Walz
Journal of Data and Information Quality;

Broad, long-term financial, and economic datasets are scarce resources, particularly in the European context. In this paper, we present an approach for an extensible data model that is adaptable to future changes in technologies and sources. This model may constitute a basis for digitized and structured long-term historical datasets for different jurisdictions and periods. The data model covers the specific peculiarities of historical financial and economic data and is flexible enough to reach out for data of different types (quantitative as well as qualitative) from different historical sources, hence, achieving extensibility. Furthermore, we outline a relational implementation of this approach based on historical German firm and stock market data from 1920 to 1932.
Justin M Johnson, Taghi M Khoshgoftaar
Journal of Data and Information Quality;

Class label noise is a critical component of data quality that directly inhibits the predictive performance of machine learning algorithms. While many data-level and algorithm-level methods exist for treating label noise, the challenges associated with big data call for new and improved methods. This survey addresses these concerns by providing an extensive literature review on treating label noise within big data. We begin with an introduction to the class label noise problem and traditional methods for treating label noise. Next, we present 30 methods for treating class label noise in a range of big data contexts, i.e. high volume, high variety, and high velocity problems. The surveyed works include distributed solutions capable of operating on data sets of arbitrary sizes, deep learning techniques for large-scale data sets with limited clean labels, and streaming techniques for detecting class noise in the presence of concept drift. Common trends and best practices are identified in each of these areas, implementation details are reviewed, empirical results are compared across studies when applicable, and references to 17 open-source projects and programming packages are provided. An emphasis on label noise challenges, solutions, and empirical results as they relate to big data distinguishes this work as a unique contribution that will inspire future research and guide machine learning practitioners.
Saravanan Thirumuruganathan, Mayuresh Kunjir, Mourad Ouzzani, Sanjay Chawla
Journal of Data and Information Quality, Volume 14, pp 1-9;

The data and Artificial Intelligence revolution has had a massive impact on enterprises, governments, and society alike. It is fueled by two key factors. First, data have become increasingly abundant and are often available openly. Enterprises have more data than they can process. Governments are spearheading open data initiatives by setting up data portals such as and releasing large amounts of data to the public. Second, AI engineering development is becoming increasingly democratized. Open source frameworks have enabled even an individual developer to engineer sophisticated AI systems. But with such ease of use comes the potential for irresponsible use of data. Ensuring that AI systems adhere to a set of ethical principles is one of the major problems of our age. We believe that data and model transparency has a key role to play in mitigating the deleterious effects of AI systems. In this article, we describe a framework to synthesize ideas from various domains such as data transparency, data quality, data governance among others to tackle this problem. Specifically, we advocate an approach based on automated annotations (of both data and the AI model), which has a number of appealing properties. The annotations could be used by enterprises to get visibility of potential issues, prepare data transparency reports, create and ensure policy compliance, and evaluate the readiness of data for diverse downstream AI applications. We propose a model architecture and enumerate its key components that could achieve these requirements. Finally, we describe a number of interesting challenges and opportunities.
, Maria-Esther Vidal, Cinzia Cappiello, Bernadette Farias Lóscio, Avigdor Gal, Matthias Jarke, Maurizio Lenzerini, Paolo Missier, Boris Otto, Elda Paja, et al.
Journal of Data and Information Quality, Volume 14, pp 1-12;

A data ecosystem (DE) offers a keystone-player or alliance-driven infrastructure that enables the interaction of different stakeholders and the resolution of interoperability issues among shared data. However, despite years of research in data governance and management, trustability is still affected by the absence of transparent and traceable data-driven pipelines. In this work, we focus on requirements and challenges that DEs face when ensuring data transparency. Requirements are derived from the data and organizational management, as well as from broader legal and ethical considerations. We propose a novel knowledge-driven DE architecture, providing the pillars for satisfying the analyzed requirements. We illustrate the potential of our proposal in a real-world scenario. Last, we discuss and rate the potential of the proposed architecture in the fulfillmentof these requirements.
Journal of Data and Information Quality, Volume 14, pp 1-27;

Workflows have been adopted in several scientific fields as a tool for the specification and execution of scientific experiments. In addition to automating the execution of experiments, workflow systems often include capabilities to record provenance information, which contains, among other things, data records used and generated by the workflow as a whole but also by its component modules. It is widely recognized that provenance information can be useful for the interpretation, verification, and re-use of workflow results, justifying its sharing and publication among scientists. However, workflow execution in some branches of science can manipulate sensitive datasets that contain information about individuals. To address this problem, we investigate, in this article, the problem of anonymizing the provenance of workflows. In doing so, we consider a popular class of workflows in which component modules use and generate collections of data records as a result of their invocation, as opposed to a single data record. The solution we propose offers guarantees of confidentiality without compromising lineage information, which provides transparency as to the relationships between the data records used and generated by the workflow modules. We provide algorithmic solutions that show how the provenance of a single module and an entire workflow can be anonymized and present the results of experiments that we conducted for their evaluation.
Tooska Dargahi, Hossein Ahmadvand, Mansour Naser Alraja, Chia-Mu Yu
Journal of Data and Information Quality, Volume 14, pp 1-10;

Connected and Autonomous Vehicles (CAVs) are introduced to improve individuals’ quality of life by offering a wide range of services. They collect a huge amount of data and exchange them with each other and the infrastructure. The collected data usually includes sensitive information about the users and the surrounding environment. Therefore, data security and privacy are among the main challenges in this industry. Blockchain, an emerging distributed ledger, has been considered by the research community as a potential solution for enhancing data security, integrity, and transparency in Intelligent Transportation Systems (ITS). However, despite the emphasis of governments on the transparency of personal data protection practices, CAV stakeholders have not been successful in communicating appropriate information with the end users regarding the procedure of collecting, storing, and processing their personal data, as well as the data ownership. This article provides a vision of the opportunities and challenges of adopting blockchain in ITS from the “data transparency” and “privacy” perspective. The main aim is to answer the following questions: (1) Considering the amount of personal data collected by the CAVs, such as location, how would the integration of blockchain technology affect transparency , fairness , and lawfulness of personal data processing concerning the data subjects (as this is one of the main principles in the existing data protection regulations)? (2) How can the trade-off between transparency and privacy be addressed in blockchain-based ITS use cases?
Back to Top Top