Text mining and probabilistic language modeling for online review spam detection

1 December 2011

journal article
research article
Published by Association for Computing Machinery (ACM) in ACM Transactions on Management Information Systems

Vol. 2 (4), 1-30
https://doi.org/10.1145/2070710.2070716

Abstract

In the era of Web 2.0, huge volumes of consumer reviews are posted to the Internet every day. Manual approaches to detecting and analyzing fake reviews (i.e., spam) are not practical due to the problem of information overload. However, the design and development of automated methods of detecting fake reviews is a challenging research problem. The main reason is that fake reviews are specifically composed to mislead readers, so they may appear the same as legitimate reviews (i.e., ham). As a result, discriminatory features that would enable individual reviews to be classified as spam or ham may not be available. Guided by the design science research methodology, the main contribution of this study is the design and instantiation of novel computational models for detecting fake reviews. In particular, a novel text mining model is developed and integrated into a semantic language model for the detection of untruthful reviews. The models are then evaluated based on a real-world dataset collected from amazon.com. The results of our experiments confirm that the proposed models outperform other well-known baseline models in detecting fake reviews. To the best of our knowledge, the work discussed in this article represents the first successful attempt to apply text mining methods and semantic language models to the detection of fake consumer reviews. A managerial implication of our research is that firms can apply our design artifacts to monitor online consumer reviews to develop effective marketing or product design strategies based on genuine consumer feedback posted to the Internet.

Keywords

Funding Information

Research Grants Council, University Grants Committee, Hong Kong (9041569)
Hong Kong's SRG (7002426)

This publication has 45 references indexed in Scilit:

Toward a semantic granularity model for domain-specific information retrieval
ACM Transactions on Information Systems, 2011
Link spam target detection using page farms
ACM Transactions on Knowledge Discovery From Data, 2009
Trusting spam reporters
ACM Transactions on Information Systems, 2008
Stylometric Identification in Electronic Markets: Scalability and Robustness
Journal of Management Information Systems, 2008
A Design Science Research Methodology for Information Systems Research
Journal of Management Information Systems, 2007
Online supervised spam filter evaluation
ACM Transactions on Information Systems, 2007
Inferential language models for information retrieval
ACM Transactions on Asian Language Information Processing, 2006
Extraction and representation of contextual information for knowledge discovery in texts
Information Sciences, 2003
Collection statistics for fast duplicate document detection
ACM Transactions on Information Systems, 2002
On Information and Sufficiency
The Annals of Mathematical Statistics, 1951

Cited by 127 articles