Switch Point biased Self-Training: Re-purposing Pretrained Models for Code-Switching
2021
Code-switching (CS), a ubiquitous phenomenon due to the ease of communication
it offers in multilingual communities still remains an understudied problem in
language processing. The primary reasons behind this are: (1) minimal efforts
in leveraging large pretrained multilingual models, and (2) the lack of
annotated data. The distinguishing case of low performance of multilingual
models in CS is the intra-sentence mixing of languages leading to switch
points. We first benchmark two sequence labeling tasks -- POS and NER on 4
different language pairs with a suite of pretrained models to identify the
problems and select the best performing model, char-BERT, among them
(addressing (1)). We then propose a self training method to repurpose the
existing pretrained models using a switch-point bias by leveraging unannotated
data (addressing (2)). We finally demonstrate that our approach performs well
on both tasks by reducing the gap between the switch point performance while
retaining the overall performance on two distinct language pairs in both the
tasks. Our code is available here:
https://github.com/PC09/EMNLP2021-Switch-Point-biased-Self-Training.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
32
References
0
Citations
NaN
KQI