Improving the Accuracy of Automated Occupation Coding at Any Production Rate

2016 
Occupation coding, an important task in official statistics, refers to coding a respondent's text answer into one of many hundreds of occupation codes. To date, occupation coding is still at least partially conducted manually at great expense. We propose two new methods for automatic coding: a hybrid method that combines a rule-based approach based on duplicates with a statistical learning algorithm, and a modified nearest neighbor approach. Using data from the German General Social Survey (ALLBUS), we show that both methods improve on both the coding accuracy of the underlying statistical learning algorithm and the coding accuracy of duplicates where duplicates exist. We also find that statistical learning is improved by combining separate models for the detailed occupation codes and for aggregate occupation codes. Further, we and defing duplicates based on n-gram variables (a concept from text mining) is preferable to one based on exact string matches.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    6
    References
    1
    Citations
    NaN
    KQI
    []