ACM Transactions on Knowledge Discovery from Data

Journal Information
ISSN / EISSN : 1556-4681 / 1556-472X
Total articles ≅ 650
Current Coverage
Archived in

Latest articles in this journal

Cong Tran, , Andreas Spitz
ACM Transactions on Knowledge Discovery from Data, Volume 16, pp 1-24;

The discovery of community structures in social networks has gained significant attention since it is a fundamental problem in understanding the networks’ topology and functions. However, most social network data are collected from partially observable networks with both missing nodes and edges . In this article, we address a new problem of detecting overlapping community structures in the context of such an incomplete network, where communities in the network are allowed to overlap since nodes belong to multiple communities at once. To solve this problem, we introduce KroMFac , a new framework that conducts community detection via regularized nonnegative matrix factorization (NMF) based on the Kronecker graph model. Specifically, from an inferred Kronecker generative parameter matrix, we first estimate the missing part of the network. As our major contribution to the proposed framework, to improve community detection accuracy, we then characterize and select influential nodes (which tend to have high degrees) by ranking, and add them to the existing graph. Finally, we uncover the community structures by solving the regularized NMF-aided optimization problem in terms of maximizing the likelihood of the underlying graph. Furthermore, adopting normalized mutual information (NMI), we empirically show superiority of our KroMFac approach over two baseline schemes by using both synthetic and real-world networks.
ACM Transactions on Knowledge Discovery from Data, Volume 16, pp 1-35;

The spread of online reviews and opinions and its growing influence on people’s behavior and decisions boosted the interest to extract meaningful information from this data deluge. Hence, crowdsourced ratings of products and services gained a critical role in business and governments. Current state-of-the-art solutions rank the items with an average of the ratings expressed for an item, with a consequent lack of personalization for the users, and the exposure to attacks and spamming/spurious users. Using these ratings to group users with similar preferences might be useful to present users with items that reflect their preferences and overcome those vulnerabilities. In this article, we propose a new reputation-based ranking system, utilizing multipartite rating subnetworks, which clusters users by their similarities using three measures, two of them based on Kolmogorov complexity. We also study its resistance to bribery and how to design optimal bribing strategies. Our system is novel in that it reflects the diversity of preferences by (possibly) assigning distinct rankings to the same item, for different groups of users. We prove the convergence and efficiency of the system. By testing it on synthetic and real data, we see that it copes better with spamming/spurious users, being more robust to attacks than state-of-the-art approaches. Also, by clustering users, the effect of bribery in the proposed multipartite ranking system is dimmed, comparing to the bipartite case.
Fandel Lin,
ACM Transactions on Knowledge Discovery from Data, Volume 16, pp 1-36;

In this work, a novel decision assistant system for urban transportation, called Route Scheme Assistant (RSA), is proposed to address two crucial issues that few former researches have focused on: route-based passenger flow (PF) inference and multivariant high-PF route recommendation. First, RSA can estimate the PF of arbitrary user-designated routes effectively by utilizing Deep Neural Network (DNN) for regression based on geographical information and spatial-temporal urban informatics. Second, our proposed Bidirectional Prioritized Spanning Tree (BDPST) intelligently combines the parallel computing concept and Gaussian mixture model (GMM) for route recommendation under users’ constraints running in a timely manner. We did experiments on bus-ticket data of Tainan and Chicago and the experimental results show that the PF inference model outperforms baseline and comparative methods from 41% to 57%. Moreover, the proposed BDPST algorithm's performance is not far away from the optimal PF and outperforms other comparative methods from 39% to 71% in large-scale route recommendations.
Lichen Wang, Zhengming Ding, Yun Fu
ACM Transactions on Knowledge Discovery from Data, Volume 16, pp 1-20;

Multi-label learning recovers multiple labels from a single instance. It is a more challenging task compared with single-label manner. Most multi-label learning approaches need large-scale well-labeled samples to achieve high accurate performance. However, it is expensive to build such a dataset. In this work, we propose a generic multi-label learning framework based on Adaptive Graph and Marginalized Augmentation (AGMA) in a semi-supervised scenario. Generally speaking, AGMA makes use of a small amount of labeled data associated with a lot of unlabeled data to boost the learning performance. First, an adaptive similarity graph is learned to effectively capture the intrinsic structure within the data. Second, marginalized augmentation strategy is explored to enhance the model generalization and robustness. Third, a feature-label autoencoder is further deployed to improve inferring efficiency. All the modules are jointly trained to benefit each other. State-of-the-art benchmarks in both traditional and zero-shot multi-label learning scenarios are evaluated. Experiments and ablation studies illustrate the accuracy and efficiency of our AGMA method.
Huafeng Liu, Liping Jing, Jingxuan Wen, Pengyu Xu, Jian Yu, Michael K. Ng
ACM Transactions on Knowledge Discovery from Data, Volume 16, pp 1-34;

Social relations between users have been proven to be a good type of auxiliary information to improve the recommendation performance. However, it is a challenging issue to sufficiently exploit the social relations and correctly determine the user preference from both social and rating information. In this article, we propose a unified Bayesian Additive Matrix Approximation model (BAMA), which takes advantage of rating preference and social network to provide high-quality recommendation. The basic idea of BAMA is to extract social influence from social networks, integrate them to Bayesian additive co-clustering for effectively determining the user clusters and item clusters, and provide an accurate rating prediction. In addition, an efficient algorithm with collapsed Gibbs Sampling is designed to inference the proposed model. A series of experiments were conducted on six real-world social datasets. The results demonstrate the superiority of the proposed BAMA by comparing with the state-of-the-art methods from three views, all users, cold-start users, and users with few social relations. With the aid of social information, furthermore, BAMA has ability to provide the explainable recommendation.
Jinjin Guo, Longbing Cao, Zhiguo Gong
ACM Transactions on Knowledge Discovery from Data, Volume 16, pp 1-32;

The abundant sequential documents such as online archival, social media, and news feeds are streamingly updated, where each chunk of documents is incorporated with smoothly evolving yet dependent topics. Such digital texts have attracted extensive research on dynamic topic modeling to infer hidden evolving topics and their temporal dependencies. However, most of the existing approaches focus on single-topic-thread evolution and ignore the fact that a current topic may be coupled with multiple relevant prior topics. In addition, these approaches also incur the intractable inference problem when inferring latent parameters, resulting in a high computational cost and performance degradation. In this work, we assume that a current topic evolves from all prior topics with corresponding coupling weights, forming the multi-topic-thread evolution . Our method models the dependencies between evolving topics and thoroughly encodes their complex multi-couplings across time steps. To conquer the intractable inference challenge, a new solution with a set of novel data augmentation techniques is proposed, which successfully discomposes the multi-couplings between evolving topics. A fully conjugate model is thus obtained to guarantee the effectiveness and efficiency of the inference technique. A novel Gibbs sampler with a backward–forward filter algorithm efficiently learns latent time-evolving parameters in a closed-form. In addition, the latent Indian Buffet Process compound distribution is exploited to automatically infer the overall topic number and customize the sparse topic proportions for each sequential document without bias. The proposed method is evaluated on both synthetic and real-world datasets against the competitive baselines, demonstrating its superiority over the baselines in terms of the low per-word perplexity, high coherent topics, and better document time prediction.
Haobing Liu, Yanmin Zhu, Tianzi Zang, Yanan Xu, Jiadi Yu, Feilong Tang
ACM Transactions on Knowledge Discovery from Data, Volume 16, pp 1-24;

Prediction tasks about students have practical significance for both student and college. Making multiple predictions about students is an important part of a smart campus. For instance, predicting whether a student will fail to graduate can alert the student affairs office to take predictive measures to help the student improve his/her academic performance. With the development of information technology in colleges, we can collect digital footprints that encode heterogeneous behaviors continuously. In this article, we focus on modeling heterogeneous behaviors and making multiple predictions together, since some prediction tasks are related and learning the model for a specific task may have the data sparsity problem. To this end, we propose a variant of Long-Short Term Memory (LSTM) and a soft-attention mechanism. The proposed LSTM is able to learn the student profile-aware representation from heterogeneous behavior sequences. The proposed soft-attention mechanism can dynamically learn different importance degrees of different days for every student. In this way, heterogeneous behaviors can be well modeled. In order to model interactions among multiple prediction tasks, we propose a co-attention mechanism based unit. With the help of the stacked units, we can explicitly control the knowledge transfer among multiple tasks. We design three motivating behavior prediction tasks based on a real-world dataset collected from a college. Qualitative and quantitative experiments on the three prediction tasks have demonstrated the effectiveness of our model.
, Jun-Peng Fang, Yi-Bo Wang
ACM Transactions on Knowledge Discovery from Data, Volume 16, pp 1-23;

In multi-label classification, the task is to induce predictive models which can assign a set of relevant labels for the unseen instance. The strategy of label-specific features has been widely employed in learning from multi-label examples, where the classification model for predicting the relevancy of each class label is induced based on its tailored features rather than the original features. Existing approaches work by generating a group of tailored features for each class label independently, where label correlations are not fully considered in the label-specific features generation process. In this article, we extend existing strategy by proposing a simple yet effective approach based on BiLabel-specific features. Specifically, a group of tailored features is generated for a pair of class labels with heuristic prototype selection and embedding. Thereafter, predictions of classifiers induced by BiLabel-specific features are ensembled to determine the relevancy of each class label for unseen instance. To thoroughly evaluate the BiLabel-specific features strategy, extensive experiments are conducted over a total of 35 benchmark datasets. Comparative studies against state-of-the-art label-specific features techniques clearly validate the superiority of utilizing BiLabel-specific features to yield stronger generalization performance for multi-label classification.
Juhee Han,
ACM Transactions on Knowledge Discovery from Data, Volume 16, pp 1-11;

Competitor analysis is an essential component of corporate strategy, providing both offensive and defensive strategic contexts to identify opportunities and threats. The rapid development of social media has recently led to several methodologies and frameworks facilitating competitor analysis through online reviews. Existing studies only focused on detecting comparative sentences in review comments or utilized low-performance models. However, this study proposes a novel approach to identifying the competitive factors using a recent explainable artificial intelligence approach at the comprehensive product feature level. We establish a model to classify the review comments for each corresponding product and evaluate the relevance of each keyword in such comments during the classification process. We then extract and prioritize the keywords and determine their competitiveness based on relevance. Our experiment results show that the proposed method can effectively extract the competitive factors both qualitatively and quantitatively.
Tong Xia, Junjie Lin, Yong Li, Jie Feng, Pan Hui, Funing Sun, Diansheng Guo, Depeng Jin
ACM Transactions on Knowledge Discovery from Data, Volume 15, pp 1-21;

Crowd flow prediction is an essential task benefiting a wide range of applications for the transportation system and public safety. However, it is a challenging problem due to the complex spatio-temporal dependence and the complicated impact of urban structure on the crowd flow patterns. In this article, we propose a novel framework, 3- D imensional G raph C onvolution N etwork (3DGCN), to predict citywide crowd flow. We first model it as a dynamic spatio-temporal graph prediction problem, where each node represents a region with time-varying flows, and each edge represents the origin–destination (OD) flow between its corresponding regions. As such, OD flows among regions are treated as a proxy for the spatial interactions among regions. To tackle the complex spatio-temporal dependence, our proposed 3DGCN can model the correlation among graph spatial and temporal neighbors simultaneously. To learn and incorporate urban structures in crowd flow prediction, we design the GCN aggregator to be learned from both crowd flow prediction and region function inference at the same time. Extensive experiments with real-world datasets in two cities demonstrate that our model outperforms state-of-the-art baselines by 9.6%∼19.5% for the next-time-interval prediction.
Back to Top Top