BPEmb: Pre-trained Subword Embeddings in 275 Languages (LREC 2018)

Benjamin Heinzerling

BPEmb: Pre-trained Subword Embeddings in 275 Languages (LREC 2018)

2019

Benjamin Heinzerling

BPEmb is a collection of pre-trained subword unit embeddings in 275 languages, based on Byte-Pair Encoding (BPE). In an evaluation using fine-grained entity typing as testbed, BPEmb performs competitively, and for some languages better than alternative subword approaches, while requiring vastly fewer resources and no tokenization.

Keywords:

Encoding (memory)
Tokenization (data security)
Testbed
Artificial intelligence
Computer science
Natural language processing
Information and Computer Science

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations