Feature engineering for detecting spammers on Twitter: Modelling and analysis
- 9 January 2017
- journal article
- research article
- Published by SAGE Publications in Journal of Information Science
- Vol. 44 (2), 230-247
- https://doi.org/10.1177/0165551516684296
Abstract
Twitter is a social networking website that has gained a lot of popularity around the world in the last decade. This popularity made Twitter a common target for spammers and malicious users to spread unwanted advertisements, viruses and phishing attacks. In this article, we review the latest research works to determine the most effective features that were investigated for spam detection in the literature. These features are collected to build a comprehensive data set that can be used to develop more robust and accurate spammer detection models. The new data set is tested using popular classifiers (Naive Bayes, support vector machines, multilayer perceptron neural networks, Decision Trees, Random forests and k-Nearest Neighbour). The prediction performance of these classifiers is evaluated and compared based on different evaluation metrics. Moreover, a further analysis is carried out to identify the features that have higher impact on the accuracy of spam detection. Three different techniques are used and compared for this analysis: change of mean square error (CoM), information gain (IG) and Relief-F method. Top five features identified by each technique are used again to build the detection models. Experimental results show that most of the developed classifiers obtained high evaluation results based on the comprehensive data set constructed in this work. Experiments also reveal the important role of some features like the reputation of the account, average length of the tweet, average mention per tweet, age of the account, and the average time between posts in the process of identifying spammers in the social network.This publication has 14 references indexed in Scilit:
- Detecting Non‐personal and Spam Users on Geo‐tagged Twitter NetworkTransactions in GIS, 2014
- A data-driven approach to predict the success of bank telemarketingDecision Support Systems, 2014
- Multi-Class Tweet Categorization Using Map Reduce ParadigmInternational Journal of Computer Trends and Technology, 2014
- SMS spam filtering: Methods and dataExpert Systems with Applications, 2012
- Content-based analysis to detect Arabic web spamJournal of Information Science, 2012
- Epidemic Outbreak and Spread Detection System Based on Twitter DataLecture Notes in Computer Science, 2012
- Toward optimal feature selection using ranking methods and classification algorithmsYugoslav Journal of Operations Research, 2011
- The WEKA data mining softwareACM SIGKDD Explorations Newsletter, 2009
- Information Gain, Correlation and Support Vector MachinesStudies in Fuzziness and Soft Computing, 2008
- Ranking importance of input parameters of neural networksExpert Systems with Applications, 1998