logo
    A variety of automated classification approaches have been developed to extract species detection information from large bioacoustic datasets. Convolutional neural networks (CNNs) are an image classification technique that can be operated on the spectrogram of an audio recording. Using CNNs for bioacoustic classification negates the need for sophisticated feature extraction techniques; however, CNNs may be sensitive to the parameters used to create spectrograms. We used AlexNet to classify spectrograms of audio clips from 19 species of birdsong. We trained and tested AlexNet with the spectrograms and observed that mean classification accuracy ranged from 88.9% to 96.9% depending on the parameters used to create the spectrogram. Classification accuracy was highest when we used a composite of four spectrograms with different combinations of scales for frequency and amplitude. Classification accuracy also varied depending on the FFT window size of the spectrogram. Overall, our results suggest that optimal spectrogram parameters for CNN classification may differ from those used for human visualization or other classification approaches. We suggest that if spectrogram parameters are appropriately selected, classification accuracy similar to current state-of-the-art methods can be achieved using off-the-shelf software and without the need to extract domain-specific features.
    Spectrogram
    Bioacoustics
    Feature (linguistics)
    Spectrogram
    Harmonic
    Line (geometry)
    SIGNAL (programming language)
    It has been widely recognized that the FFT-based spectrogram does not provide good simultaneous resolution in both time and frequency domains. A new method of spectral analysis has been developed based upon the Gabor expansion and the Wigner–Ville distribution. The resolution of the Gabor spectrogram is twice as high as that of a FFT-based spectrogram. In this report, FFT-based spectrograms and Gabor spectrograms are compared for 5 English vowels, 6 stops consonants, 4 fricatives, and vowels format transitions in a CVD contents on 6 normal subjects. Results demonstrate that the Gabor spectrogram is a promising alternative for FFT-based spectrogram in speech analysis because of its higher temporal and frequency resolution.
    Spectrogram
    Gabor transform
    Frequency analysis
    Citations (0)
    Singing voice recognition is a difficult topic in music information retrieval research area. The first approaches borrowed successful techniques widely used in automatic speech recognition (ASR) as speech and singing share similar acoustical feature since they are produced by the same apparatus. Moving from monophonic to polyphonic audio signal the problem become more complex as the background instrumental accompaniment is regarded as a noise source that has to be attenuated. This paper proposes a singing voice recognition algorithm that is able to automatically recognize the word in a singing signal with background music by using the concept of spectrogram pattern matching. The main idea is to apply both the spectrogram and the image processing methods to solve the problem of singing voice recognition. Each signal that accompanies music is analyzed and generated to its spectrogram that is used to train data for the classifier. Several classification functions are compared, such as Fisher classifier, feed-forward can effectively recognize the word in music with the accuracy rate more than 84%.
    Spectrogram
    Feature (linguistics)
    Citations (18)
    ABSTRACT Spectrograms visualise the time-frequency content of a signal. They are commonly used to analyse animal vocalisations. Here, we analyse how far we can deduce the mechanical origin of sound generation and modulation from the spectrogram. We investigate the relationship between simple mathematical events such as transients, harmonics, amplitude- and frequency modulation and the resulting structures in spectrograms. This approach yields not only convenient statistical description, but also aids in formulating hypotheses about the underlying mathematical mechanisms. We then discuss to what extent it is possible to invert our analysis and relate structures in spectrograms back to the underlying mathematical and mechanical events using two exemplary approaches: (a) we analyse the spectrogram of a vocalisation of the Bearded Vulture and postulate hypotheses on the mathematical origin of the signal. Furthermore, we synthesise the signal using the simple mathematical principles presented earlier; (b) we use a simple mechanical model to generate sounds and relate experimentally observed mechanical events to characteristics of the spectrogram. We conclude that although knowledge of sound producing systems increases the explanatory power of a spectrogram, a spectrogram per se cannot present unambiguous evidence about the underlying mechanical origin of the sound signal. Keywords: BioacousticsbiomechanicsFourier analysis Gypaetus barbatus Duffing equation
    Spectrogram
    Bioacoustics
    SIGNAL (programming language)
    Anomalous Sound Detection (ASD) aims to identify whether the sound emitted from a machine is anomalous or not. Most advanced methods use 2-D CNNs to extract features of normal sounds from log-mel spectrograms for ASD. However, these methods can not fully exploit temporal information of log-mel spectrograms, resulting in poor performance on some machine types. In this paper, we propose a new framework for ASD named Spectrogram-Wavegram WaveNet (SW-WaveNet), which segments the 2-D log-mel spectrogram into 1-D waveform signals of different frequency bands and combines the representation vector extracted by WaveNet from segmented log-mel spectrograms and Wavegrams, respectively. The proposed framework utilizes WaveNet's powerful capability of modeling waveform signals to effectively extract temporal information from log-mel spectrograms and Wavegrams. Experiments on the DCASE 2020 Challenge Task 2 dataset show that our framework achieves higher average AUC scores (93.25%) and pAUC scores (87.41%) than the previous works.
    Spectrogram
    Representation
    In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.
    Spectrogram
    Discriminative model
    Speaker diarisation
    SIGNAL (programming language)
    Citations (35)
    While established methods for imaging the time-frequency content of speech—such as the spectrogram—have frequently been christened ‘‘voiceprinting,’’ it is well-known that it and other currently popular imaging techniques cannot identify an individual’s voice to more than a suggestive extent. The reassigned spectrogram (also known by other names) is a relatively little-known method [S. A. Fulop and K. Fitz, ‘‘Algorithms for computing the time-corrected instantaneous frequency (reassigned) spectrogram, with applications,’’ J. Acoust. Soc. Am. 119, 360–377 (2006)] for imaging the time-frequency spectral information contained in a signal, which is able to show the instantaneous frequencies of signal components as well as the occurrence of impulses with dramatically increased precision compared to the spectrogram (magnitude of the short-time Fourier transform) or any other energy density time-frequency representation. It is shown here that it is possible to obtain a reassigned spectrogram image from a person’s voice that appears to be sufficiently individuating and consistent so as to serve as a true voiceprint identifying that person and excluding all other persons to a high degree of confidence. This is achieved by focusing on just a few phonatory pulsations, thereby revealing the vocal-fold vibrational signature unique to each person.
    Spectrogram
    SIGNAL (programming language)
    Instantaneous phase
    Representation
    Signature (topology)
    Citations (0)