Link analysis for Web spam detection
- 1 February 2008
- journal article
- research article
- Published by Association for Computing Machinery (ACM) in ACM Transactions on the Web
- Vol. 2 (1), 1-42
- https://doi.org/10.1145/1326561.1326563
Abstract
We propose link-based techniques for automatic detection of Web spam, a term referring to pages which use deceptive techniques to obtain undeservedly high scores in search engines. The use of Web spam is widespread and difficult to solve, mostly due to the large size of the Web which means that, in practice, many algorithms are infeasible. We perform a statistical analysis of a large collection of Web pages. In particular, we compute statistics of the links in the vicinity of every Web page applying rank propagation and probabilistic counting over the entire Web graph in a scalable way. These statistical features are used to build Web spam classifiers which only consider the link structure of the Web, regardless of page contents. We then present a study of the performance of each of the classifiers alone, as well as their combined performance, by testing them over a large collection of Web link spam. After tenfold cross-validation, our best classifiers have a performance comparable to that of state-of-the-art spam classifiers that use content attributes, but are orthogonal to content-based methods.Keywords
Funding Information
- Seventh Framework Programme (IST-015964 AEOLUSIST-001907 DELIS)
- Ministero dell'Istruzione, dell'Università e della Ricerca (RBIN047MH9)
This publication has 30 references indexed in Scilit:
- A reference collection for web spamACM SIGIR Forum, 2006
- The bubble of web visibilityCommunications of the ACM, 2005
- UbiCrawler: a scalable fully distributed Web crawlerSoftware: Practice and Experience, 2004
- Network Applications of Bloom Filters: A SurveyInternet Mathematics, 2004
- External memory algorithms and data structuresACM Computing Surveys, 2001
- The Space Complexity of Approximating the Frequency MomentsJournal of Computer and System Sciences, 1999
- Size-Estimation Framework with Applications to Transitive Closure and ReachabilityJournal of Computer and System Sciences, 1997
- Networks of sexual contactsAIDS, 1989
- Probabilistic counting algorithms for data base applicationsJournal of Computer and System Sciences, 1985
- Counting large numbers of events in small registersCommunications of the ACM, 1978