Improved Cross-Lingual Document Similarity Measurement

2020 
We present an efficient and effective system to identify similar documents in the target language for a given document in the source language. For our work, we used source and target documents from the Sinhala and English languages. However, the system can be extended to any other languages for which suitable embeddings exist. We have improved both accuracy and speed compared with the current state-of-the-art. We have compiled a corpus of possible target documents in each of the two languages of interest. For a source document, we compute the distance between it and each of the documents in the corpus using their sentence embeddings. We used nearest neighbor retrieval to speed up the matching by restricting the set of target documents searched for a given source document. We used a scoring function and matching algorithm to properly pair the identified sentences. To improve accuracy, we used number matching and named entity matching.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    11
    References
    1
    Citations
    NaN
    KQI
    []