IRLbot
Top Cited Papers
- 1 June 2009
- journal article
- Published by Association for Computing Machinery (ACM) in ACM Transactions on the Web
- Vol. 3 (3), 1-34
- https://doi.org/10.1145/1541822.1541823
Abstract
This article shares our experience in designing a Web crawler that can download billions of pages using a single-server implementation and models its performance. We first show that current crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large crawls, highly branching spam, legitimate multimillion-page blog sites, and infinite loops created by server-side scripts. We then offer a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted 41 days, IRLbot running on a single server successfully crawled 6.3 billion valid HTML pages (7.6 billion connection requests) and sustained an average download rate of 319 mb/s (1,789 pages/s). Unlike our prior experiments with algorithms proposed in related work, this version of IRLbot did not experience any bottlenecks and successfully handled content from over 117 million hosts, parsed out 394 billion links, and discovered a subset of the Web graph with 41 billion unique nodes.Keywords
This publication has 21 references indexed in Scilit:
- Detecting near-duplicates for web crawlingPublished by Association for Computing Machinery (ACM) ,2007
- Stanford WebBase components and applicationsACM Transactions on Internet Technology, 2006
- UbiCrawler: a scalable fully distributed Web crawlerSoftware: Practice and Experience, 2004
- High-Performance Web CrawlingPublished by Springer Science and Business Media LLC ,2002
- Searching the WebACM Transactions on Internet Technology, 2001
- External memory algorithms and data structuresACM Computing Surveys, 2001
- Syntactic clustering of the WebComputer Networks and ISDN Systems, 1997
- Lycos: design choices in an Internet search serviceIEEE Expert, 1997
- The RBSE spider — Balancing effective search against Web loadComputer Networks and ISDN Systems, 1994
- GENVL and WWWW: Tools for taming the WebComputer Networks and ISDN Systems, 1994