Feature engineering for detecting spammers on Twitter: Modelling and analysis

9 January 2017

journal article
research article
Published by SAGE Publications in Journal of Information Science

Vol. 44 (2), 230-247
https://doi.org/10.1177/0165551516684296

Abstract

Twitter is a social networking website that has gained a lot of popularity around the world in the last decade. This popularity made Twitter a common target for spammers and malicious users to spread unwanted advertisements, viruses and phishing attacks. In this article, we review the latest research works to determine the most effective features that were investigated for spam detection in the literature. These features are collected to build a comprehensive data set that can be used to develop more robust and accurate spammer detection models. The new data set is tested using popular classifiers (Naive Bayes, support vector machines, multilayer perceptron neural networks, Decision Trees, Random forests and k-Nearest Neighbour). The prediction performance of these classifiers is evaluated and compared based on different evaluation metrics. Moreover, a further analysis is carried out to identify the features that have higher impact on the accuracy of spam detection. Three different techniques are used and compared for this analysis: change of mean square error (CoM), information gain (IG) and Relief-F method. Top five features identified by each technique are used again to build the detection models. Experimental results show that most of the developed classifiers obtained high evaluation results based on the comprehensive data set constructed in this work. Experiments also reveal the important role of some features like the reputation of the account, average length of the tweet, average mention per tweet, age of the account, and the average time between posts in the process of identifying spammers in the social network.

This publication has 14 references indexed in Scilit:

Detecting Non‐personal and Spam Users on Geo‐tagged Twitter Network
Transactions in GIS, 2014
A data-driven approach to predict the success of bank telemarketing
Decision Support Systems, 2014
Multi-Class Tweet Categorization Using Map Reduce Paradigm
International Journal of Computer Trends and Technology, 2014
SMS spam filtering: Methods and data
Expert Systems with Applications, 2012
Content-based analysis to detect Arabic web spam
Journal of Information Science, 2012
Epidemic Outbreak and Spread Detection System Based on Twitter Data
Lecture Notes in Computer Science, 2012
Toward optimal feature selection using ranking methods and classification algorithms
Yugoslav Journal of Operations Research, 2011
The WEKA data mining software
ACM SIGKDD Explorations Newsletter, 2009
Information Gain, Correlation and Support Vector Machines
Studies in Fuzziness and Soft Computing, 2008
Ranking importance of input parameters of neural networks
Expert Systems with Applications, 1998

Cited by 35 articles