Joint Speaker Diarization and Recognition Using Convolutional and Recurrent Neural Networks

Zhihan Zhou,Yichi Zhang,Zhiyao Duan

Joint Speaker Diarization and Recognition Using Convolutional and Recurrent Neural Networks

2018

Speaker diarization (detecting who-spoke-when using relative identity labels) and speaker recognition (detecting absolute identity labels without timing) are different but related tasks that often need to be completed simultaneously in many scenarios. Traditional methods, however, address them independently. In this paper, we propose a method to jointly diarize and recognize speakers from a collection of conversations. This method benefits from the sparsity and temporal smoothness of speakers within a conversation and the large-scale timbre modeling across recordings and speakers. Specifically, we employ one convolutional neural network (CNN) to perform segment-level speaker classification and another CNN to detect the probability of speaker change within a conversation. We then concatenate the output of both CNNs and feed it into a recurrent neural network (RNN) for joint speaker diarization and recognition. Experiments on different datasets show promising performance of our proposed approach.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations