Enabling information extraction by inference of regular expressions from sample entities

Falk Brauer,Robert Rieger,Adrian Mocan,Wojciech M. Barczynski

Enabling information extraction by inference of regular expressions from sample entities

2011

Regular expressions are the dominant technique to extract business relevant entities (e.g., invoice numbers or product names) from text data (e.g., invoices), since these entity types often follow a strict underlying syntactical pattern. However, the manual construction of regular expressions that guarantee a high recall and precision is a tedious manual task and requires expert knowledge. In this paper, we propose an approach that automatically infers regular expressions from a set of (positive) sample entities, which in turn can be derived either from enterprise databases (e.g., a product catalog) or annotated documents (e.g., historical invoices). The main innovation of our approach is that it learns effective regular expressions that can be easily interpreted and modified by a user. The effectiveness is obtained by a novel method that weights dependent entity features of different granularity (i.e. on character and token level) against each other and selects the most suitable ones to form a regular expression.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations