Building a specialized high performance web crawler

Adrian-Ioan Vasile,Bujor Päväloiu,Paul Dan Cristea

Building a specialized high performance web crawler

2013

Adrian-Ioan Vasile
Bujor Päväloiu
Paul Dan Cristea

In this paper, we describe the design of a specialized high-performance web crawler that runs in a decentralized fashion. It is specialized for scraping data from New Media web sites such as blogs, Twitter, Facebook, etc. which in the past years has grown exponentially. The crawler is designed to be easily scalable, from a single node to hundreds or many more, to be resilient against crashes and other events, to have a low latency, to be polite and to be adaptable to various situations. We will discuss the architecture, performance bottlenecks and proper crawling etiquette.

Keywords:

World Wide Web
Architecture
Focused crawler
Latency (engineering)
New media
Computer science
Search engine
Web crawler
Scalability
Crawling
Multimedia
Politeness

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations