Exploiting knowledge in unsupervised open information extraction

2012 
The extraction of structured information from text is a long-standing challenge in Natural Language Processing (NLP) which has been reinvigorated with the ever-increasing availability of user-generated textual content Online. The ability to extract interesting and important pieces of information from text documents is crucial for large scale language understanding, which powers modern Web search engines. The field of Open Information Extraction (Open IE) offers a way to automatically discover relations from large and heterogeneous text collections. Since it is difficult to obtain adequate training data for Open IE, unsupervised approaches that rely on rules and clustering are popular. However, the major trend in unsupervised Open IE has been to borrow algorithms and low-level features from other applications such as search, relying on previous work that has been proved to be successful in other domains. This thesis argues that it is essential to use domain and external knowledge in Open IE, and proposes several ways of doing it to achieve substantial performance improvements over state-of-the-art systems. We use three main knowledge sources: (1) a large corpus of unstructured text that is used to learn a language model over relations that can be incorporated into a weighting scheme that outperforms the common TFIDF weighting scheme; (2) an external knowledge base such as Wikipedia that is used to extract fine-grained types of entities that yield better understanding of how relations are expressed in English; and (3) domain knowledge extracted from the blogosphere (e.g., the degree of a node in the network) that is used to improve performance at scale.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []