A Comparative Study of Sequence Tagging Methods for Domain Knowledge Entity Recognition in Biomedical Papers

2020 
Named entity recognition has been extensively studied in the past decade. The state-of-the-art models, trained on general text such as Wikipedia articles and newsletters, have achieved F_1>0.90. Entity types are focused on people, location, organization, etc. However, entity recognition from domain-specific text, in particular research papers, is still challenging. In this paper, we perform a comparative study of sequence tagging (ST) methods on this task using a manually curated corpus from biomedical papers on Lyme disease. Each model we compare consists of a ST and a non-ST classification component. In this pilot study, we freeze the non-ST classifier to study how the ST component performs with variants of the conditional random field (CRF) and bidirectional long short-term memory (BiLSTM). The results shed light on the importance of pre-trained word embeddings such as ELMo and the residual unit. The attention mechanism and enriched features do not seem to boost the performance in recognizing entity mentions and their positions, which is likely to be caused by the relatively small training sample. We plan to improve the model by increasing the training corpus size and trying different combinations of features.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    17
    References
    2
    Citations
    NaN
    KQI
    []