Geographically focused collaborative crawling
- 23 May 2006
- conference paper
- conference paper
- Published by Association for Computing Machinery (ACM)
- p. 287-296
- https://doi.org/10.1145/1135777.1135822
Abstract
A collaborative crawler is a group of crawling nodes, in which each crawling node is responsible for a specic portion of the web. We study the problem of collecting geographi- cally-aware pages using collaborative crawling strategies. We rst propose several collaborative crawling strategies for the geographically focused crawling, whose goal is to collect web pages about specied geographic locations, by considering features like URL address of page, content of page, extended anchor text of link, and others. Later, we propose vari- ous evaluation criteria to qualify the performance of such crawling strategies. Finally, we experimentally study our crawling strategies by crawling the real web data showing that some of our crawling strategies greatly outperform the simple URL-hash based partition collaborative crawling, in which the crawling assignments are determined according to the hash-value computation over URLs. More precisely, features like URL address of page and extended anchor text of link are shown to yield the best overall performance for the geographically focused crawling.Keywords
This publication has 15 references indexed in Scilit:
- Geographical partition for distributed web crawlingPublished by Association for Computing Machinery (ACM) ,2005
- Centrality and network flowSocial Networks, 2005
- Categorizing web queries according to geographical localityPublished by Association for Computing Machinery (ACM) ,2003
- Topic-oriented collaborative crawlingPublished by Association for Computing Machinery (ACM) ,2002
- Parallel crawlersPublished by Association for Computing Machinery (ACM) ,2002
- Intelligent crawling on the World Wide Web with arbitrary predicatesPublished by Association for Computing Machinery (ACM) ,2001
- Topical locality in the WebPublished by Association for Computing Machinery (ACM) ,2000
- Authoritative sources in a hyperlinked environmentJournal of the ACM, 1999
- Focused crawling: a new approach to topic-specific Web resource discoveryComputer Networks, 1999
- Efficient crawling through URL orderingComputer Networks and ISDN Systems, 1998