IRLbot

Top Cited Papers

1 June 2009

journal article
Published by Association for Computing Machinery (ACM) in ACM Transactions on the Web

Vol. 3 (3), 1-34
https://doi.org/10.1145/1541822.1541823

Abstract

This article shares our experience in designing a Web crawler that can download billions of pages using a single-server implementation and models its performance. We first show that current crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large crawls, highly branching spam, legitimate multimillion-page blog sites, and infinite loops created by server-side scripts. We then offer a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted 41 days, IRLbot running on a single server successfully crawled 6.3 billion valid HTML pages (7.6 billion connection requests) and sustained an average download rate of 319 mb/s (1,789 pages/s). Unlike our prior experiments with algorithms proposed in related work, this version of IRLbot did not experience any bottlenecks and successfully handled content from over 117 million hosts, parsed out 394 billion links, and discovered a subset of the Web graph with 41 billion unique nodes.

Keywords

This publication has 21 references indexed in Scilit:

Detecting near-duplicates for web crawling
Published by Association for Computing Machinery (ACM) ,2007
Stanford WebBase components and applications
ACM Transactions on Internet Technology, 2006
UbiCrawler: a scalable fully distributed Web crawler
Software: Practice and Experience, 2004
High-Performance Web Crawling
Published by Springer Science and Business Media LLC ,2002
Searching the Web
ACM Transactions on Internet Technology, 2001
External memory algorithms and data structures
ACM Computing Surveys, 2001
Syntactic clustering of the Web
Computer Networks and ISDN Systems, 1997
Lycos: design choices in an Internet search service
IEEE Expert, 1997
The RBSE spider — Balancing effective search against Web load
Computer Networks and ISDN Systems, 1994
GENVL and WWWW: Tools for taming the Web
Computer Networks and ISDN Systems, 1994

Cited by 122 articles