Speech Section Extraction Method Using Image and Voice Information

Etsuro Nakamura,Yoichi Kageyama,Satoshi Hirose

Speech Section Extraction Method Using Image and Voice Information

2021

Meeting minutes are useful for efficient operations and meetings. The labor cost and time for taking the minutes can be minimized using a system that can assign a speaker to the minutes. An automatic speaker identification method using lip movements and speech data obtained by an omnidirectional camera was developed. To improve the accuracy of speaker identification, it is necessary to use the extracted data of speech sections. The proposed speech section extraction method was studied as a preprocessor for speaker identification. The proposed method consists of three processes: i) extraction of speaking frames using lip movements, ii) extraction of speaking frames using voices, and iii) discrimination of speech sections using these extraction results. In the extraction of speech sections using lip movements, the nose width was used for a threshold for automatic calculation. The speech sections can be extracted, even when the distance between the camera and the subject changes, by using a threshold based on the width of the nose. Finally, 11 sentences of speech video data (154 data) of 14 subjects were used to evaluate the usefulness of the method. The evaluation result obtained was a high F-measure of 0.96 on average. The results reveal that the proposed method can extract speech sections, even when the distance between the camera and the subject changes.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations