An open source archival quality web crawler

Gordon Mohr,Michael Stack,Igor Ranitovic,Dan Avery,Michele Kimpton

An open source archival quality web crawler

2004

Gordon Mohr
Michael Stack
Igor Ranitovic
Dan Avery
Michele Kimpton

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality webcrawler project. The Internet Archive started Heritrix development in the early part of 2003. The intention was to develop a crawler for the specific purpose of archiving websites and to support multiple different use cases including focused and broadcrawling. The software is open source to encourage collaboration and joint development across institutions with similar needs. A pluggable, extensible architecture facilitates customization and outside contribution. Now, after over a year of development, the Internet Archive and other institutions are using Heritrix to perform focused and increasingly broad crawls.

Keywords:

World Wide Web
Use case
Database
Software
Architecture
Extensibility
Personalization
Web crawler
Computer science
The Internet
extensible architecture
open source
Multimedia

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations