User-configurable OCR enhancement for online natural history archives

2007 
The creation of structured digital libraries from paper-based archives is an area of growing demand in many scientific and cultural fields, and is not satisfied either by off-the-shelf OCR or commercial form- processing systems. This paper describes and evaluates a configurable archive construction system, which integrates document image pre-processing and analysis with text post-processing tools and a standard OCR package to meet digital archiving requirements. The prototype system is currently being used in conjunction with the UK Natural History Museum to help convert more than 500,000 cards of Lepidoptera (Butterflies and Moths) and Coleoptera (Beetles) to searchable digital archives. Evaluation results covering different aspects of the system from card scanning to overall word recognition rates for different database fields are summarised for two datasets comprising over 5,000 cards selected from different parts of these archives. First-pass end-to-end word recognition rates of 70–90% are reported for key data fields, subject to availability of suitable electronic dictionaries. Further validation and correction is supported through web-editing of the online digital archive.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    17
    References
    2
    Citations
    NaN
    KQI
    []