Age and Gender Recognition Based on Multiple Systems - Early vs. Late Fusion

Tobias Bocklet,Georg Stemmer,Viktor Zeißler,Elmar Nöth

Age and Gender Recognition Based on Multiple Systems - Early vs. Late Fusion

2010

This paper focuses on the automatic recognition of a person’s age and gender based only on his or her voice. Up to five different systems are compared and combined in different configurations: three systems model the speaker’s characteristics in different feature spaces, i.e., MFCC, PLP, TRAPS, by Gaussian mixture models. The features of these systems are the concatenated mean vectors. System number 4 uses a physical two-mass vocal model and estimates in a data-driven optimization procedure 9 glottal features from voiced speech sections. For each utterance the minimum, maximum and mean vectors form a 27-dimensional feature vector. The last system calculates a 219-dimensional prosodic feature set for each utterance based on voice and unvoiced speech segments. We compare two different ways to fuse the different systems: First, we concatenate the system on feature level. The second way of combination is performed on score level by multi-class logistic regression. Despite there are just minor differences between the two approaches, late fusion is slightly superior. On the development set of the Interspeech Agender challenge we achieved an unweighted recall of 46.1% with early fusion and 47.8% with late fusion. Index Terms: acoustic analysis, classification, Gaussian mixture models

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations