PDFX: fully-automated PDF-to-XML conversion of scientific literature

Alexandru Constantin,Steve Pettifer,Andrei Voronkov

PDFX: fully-automated PDF-to-XML conversion of scientific literature

2013

PDFX is a rule-based system designed to reconstruct the logical structure of scholarly articles in PDF form, regardless of their formatting style. The system's output is an XML document that describes the input article's logical structure in terms of title, sections, tables, references, etc. and also links it to geometrical typesetting markers in the original PDF, such as paragraph and column breaks. The key aspect of the presented approach is that the rule set used relies on relative parameters derived from font and layout specifics of each article, rather than on a template-matching paradigm. The system thus obviates the need for domain- or layout-specific tuning or prior training, exploiting only typographical conventions inherent in scientific literature. Evaluated against a significantly varied corpus of articles from nearly 2000 different journals, PDFX gives a 77.45 F1 measure for top-level heading identification and 74.03 for extracting individual bibliographic items. The service is freely available for use at http://pdfx.cs.man.ac.uk/.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations