Word Embeddings vs Word Types for Sequence Labeling: the Curious Case of CV Parsing

2015 
We explore new methods of improving Curriculum Vitae (CV) parsing for German documents by applying recent research on the application of word embeddings in Natural Language Processing (NLP). Our approach integrates the word embeddings as input features for a probabilistic sequence labeling model that relies on the Conditional Random Field (CRF) framework. Best-performing word embeddings are generated from a large sample of German CVs. The best results on the extraction task are obtained by the model which integrates the word embeddings together with a number of hand-crafted features. The improvements are consistent throughout different sections of the target documents. The effect of the word embeddings is strongest on semi-structured, out-of-sample data.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    12
    References
    5
    Citations
    NaN
    KQI
    []