Word Embeddings vs Word Types for Sequence Labeling: the Curious Case of CV Parsing

Melanie Tosik,Carsten Lygteskov Hansen,Gerard Goossen,Mihai Rotaru

Word Embeddings vs Word Types for Sequence Labeling: the Curious Case of CV Parsing

2015

Melanie Tosik
Carsten Lygteskov Hansen
Gerard Goossen
Mihai Rotaru

We explore new methods of improving Curriculum Vitae (CV) parsing for German documents by applying recent research on the application of word embeddings in Natural Language Processing (NLP). Our approach integrates the word embeddings as input features for a probabilistic sequence labeling model that relies on the Conditional Random Field (CRF) framework. Best-performing word embeddings are generated from a large sample of German CVs. The best results on the extraction task are obtained by the model which integrates the word embeddings together with a number of hand-crafted features. The improvements are consistent throughout different sections of the target documents. The effect of the word embeddings is strongest on semi-structured, out-of-sample data.

Keywords:

Natural language processing
Parsing
Sequence labeling
Artificial intelligence
Conditional random field
Computer science
Probabilistic logic
German
large sample
Speech recognition

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations