Content-based analysis to detect Arabic web spam

19 April 2012

journal article
research article
Published by SAGE Publications in Journal of Information Science

Vol. 38 (3), 284-296
https://doi.org/10.1177/0165551512439173

Abstract

Search engines are important outlets for information query and retrieval. They have to deal with the continual increase of information available on the web, and provide users with convenient access to such huge amounts of information. Furthermore, with this huge amount of information, a more complex challenge that continuously gets more and more difficult to illuminate is the spam in web pages. For several reasons, web spammers try to intrude in the search results and inject artificially biased results in favour of their websites or pages. Spam pages are added to the internet on a daily basis, thus making it difficult for search engines to keep up with the fast-growing and dynamic nature of the web, especially since spammers tend to add more keywords to their websites to deceive the search engines and increase the rank of their pages. In this research, we have investigated four different classification algorithms (naïve Bayes, decision tree, SVM and K-NN) to detect Arabic web spam pages, based on content. The three groups of datasets used, with 1%, 15% and 50% spam contents, were collected using a crawler that was customized for this study. Spam pages were classified manually. Different tests and comparisons have revealed that the Decision Tree was the best classifier for this purpose.

Keywords

This publication has 21 references indexed in Scilit:

Evaluating Google queries based on language preferences
Journal of Information Science, 2011
The Automatic Evaluation of Website Metrics and State
International Journal of Web-Based Learning and Teaching Technologies, 2010
Using evidence based content trust model for spam detection
Expert Systems with Applications, 2010
Learning to Detect Web Spam by Genetic Programming
Lecture Notes in Computer Science, 2010
Identifying Spam Web Pages Based on Content Similarity
Lecture Notes in Computer Science, 2008
Content Trust Model for Detecting Web Spam
Published by Springer Science and Business Media LLC ,2007
Improving web spam classification using rank-time features
Published by Association for Computing Machinery (ACM) ,2007
EviRank: An Evidence Based Content Trust Model for Web Spam Detection
Lecture Notes in Computer Science, 2007
Spam, damn spam, and statistics
Published by Association for Computing Machinery (ACM) ,2004
A large-scale study of the evolution of web pages
Published by Association for Computing Machinery (ACM) ,2003

Cited by 13 articles