Building Chinese Biomedical Language Models via Multi-Level Text Discrimination.
2021
Pre-trained language models (PLMs), such as BERT and GPT, have revolutionized
the field of NLP, not only in the general domain but also in the biomedical
domain. Most prior efforts in building biomedical PLMs have resorted simply to
domain adaptation and focused mainly on English. In this work we introduce
eHealth, a biomedical PLM in Chinese built with a new pre-training framework.
This new framework trains eHealth as a discriminator through both token-level
and sequence-level discrimination. The former is to detect input tokens
corrupted by a generator and select their original signals from plausible
candidates, while the latter is to further distinguish corruptions of a same
original sequence from those of the others. As such, eHealth can learn language
semantics at both the token and sequence levels. Extensive experiments on 11
Chinese biomedical language understanding tasks of various forms verify the
effectiveness and superiority of our approach. The pre-trained model is
available to the public at
\url{https://github.com/PaddlePaddle/Research/tree/master/KG/eHealth} and the
code will also be released later.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
38
References
0
Citations
NaN
KQI