Building a specialized high performance web crawler

2013 
In this paper, we describe the design of a specialized high-performance web crawler that runs in a decentralized fashion. It is specialized for scraping data from New Media web sites such as blogs, Twitter, Facebook, etc. which in the past years has grown exponentially. The crawler is designed to be easily scalable, from a single node to hundreds or many more, to be resilient against crashes and other events, to have a low latency, to be polite and to be adaptable to various situations. We will discuss the architecture, performance bottlenecks and proper crawling etiquette.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    9
    References
    4
    Citations
    NaN
    KQI
    []