A Rescoring Method Using Web Search and Word Vectors for Spoken Term Detection

2019 
We propose a rescoring method using words related to a query obtained by Web search and word vectors for spoken term detection (STD). In this paper, we assume that words associated with the topic in speech data and co-occurring with the query are called “words related to the query”, and that the related words appear multiple times in the speech data. To identify the words related to the query, we introduce distributed expression of words obtained by Word2vec [1] [2], and first convert each word in the word-recognition results of speech data into a word vector. Each word vector is then compared with a word vector of the query. Words related to the query are determined by calculating the degree of similarity between the two word vectors. However, a word vector of an out-of-vocabulary (OOV) query cannot be obtained in this manner, since OOV queries do not appear in word-recognition results. For such OOV queries, we perform a Web search using the query, whereupon texts including the query are extracted. Recognition results of the speech data and the extracted texts are then combined and used for training of Word2vec. In this manner, a word vector of the OOV query can be obtained. Distances to all candidates in the document, including words related to the query, are used advantageously. Experiments are conducted to evaluate the performance of the proposed method using open test collections of the NTCIR-10[3] and NTCIR-12[4] workshops. For retrieval accuracy, an improvement of 3.2 points in mean average precision was achieved using the proposed method.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []