We present the task description of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge Task 2: ``First-shot unsupervised anomalous sound detection (ASD) for machine condition monitoring''. The main goal is to enable rapid deployment of ASD systems for new kinds of machines without the need for hyperparameter tuning. In the past ASD tasks, developed methods tuned hyperparameters for each machine type, as the development and evaluation datasets had the same machine types. However, collecting normal and anomalous data as the development dataset can be infeasible in practice. In 2023 Task 2, we focus on solving the first-shot problem, which is the challenge of training a model on a completely novel machine type. Specifically, (i) each machine type has only one section (a subset of machine type) and (ii) machine types in the development and evaluation datasets are completely different. Analysis of 86 submissions from 23 teams revealed that the keys to outperform baselines were: 1) sampling techniques for dealing with class imbalances across different domains and attributes, 2) generation of synthetic samples for robust detection, and 3) use of multiple large pre-trained models to extract meaningful embeddings for the anomaly detector.
Description This data is the ground truth for the "evaluation dataset" for the DCASE 2021 Challenge Task 2 "Unsupervised Anomalous Sound Detection for Machine Condition Monitoring under Domain Shifted Conditions". In the task, three datasets have been released: "development dataset", "additional training dataset", and "evaluation dataset". The evaluation dataset was the last of the three released and includes around 200 samples for each machine type, section index, and domain, none of which have a condition label (i.e., normal or anomaly). This ground truth dataset contains the condition labels. Data format The CSV file for each machine type, section index, and domain includes the ground truth data like the following: --------------------------------- section_03_source_test_0000.wav,1 section_03_source_test_0001.wav,1 ... section_03_source_test_0198.wav,0 section_03_source_test_0199.wav,1 --------------------------------- The first column shows the name of a wave file. The second column shows the condition label (i.e., 0: normal or 1: anomaly). How to use A script for calculating the AUC, pAUC, precision, recall, and F1 scores for the "evaluation dataset" is available on the Github repository [URL]. The ground truth data are used by this system. For more information, please see the Github repository. Conditions of use This dataset was created jointly by Hitachi, Ltd. and NTT Corporation and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license. Publication If you use this dataset, please cite all the following three papers: Yohei Kawaguchi, Keisuke Imoto, Yuma Koizumi, Noboru Harada, Daisuke Niizumi, Kota Dohi, Ryo Tanabe, Harsh Purohit, and Takashi Endo, "Description and Discussion on DCASE 2021 Challenge Task 2: Unsupervised Anomalous Sound Detection for Machine Condition Monitoring under Domain Shifted Conditions," in arXiv e-prints: 2106.04492, 2021. [URL] Noboru Harada, Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Masahiro Yasuda, Shoichiro Saito, "ToyADMOS2: Another Dataset of Miniature-Machine Operating Sounds for Anomalous Sound Detection under Domain Shift Conditions," in arXiv e-prints: 2106.02369, 2021. [URL] Ryo Tanabe, Harsh Purohit, Kota Dohi, Takashi Endo, Yuki Nikaido, Toshiki Nakamura, and Yohei Kawaguchi, "MIMII DUE: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection with Domain Shifts due to Changes in Operational and Environmental Conditions," in arXiv e-prints: 2105.02702, 2021. [URL] Feedback If there is any problem, please contact us: Yohei Kawaguchi, yohei.kawaguchi.xk@hitachi.com Daisuke Niizumi, daisuke.niizumi.dt@hco.ntt.co.jp Keisuke Imoto, keisuke.imoto@ieee.org
Description This dataset is the "evaluation dataset" for the DCASE 2021 Challenge Task 2 "Unsupervised Anomalous Sound Detection for Machine Condition Monitoring under Domain Shifted Conditions". In the task, three datasets have been or will be released: "development dataset", "additional training dataset", and "evaluation dataset". This evaluation dataset was the last of the three released. This dataset includes around 200 samples for each machine type, section index, and domain, none of which have a condition label (i.e., normal or anomaly). The recording procedure and data format are the same as the development dataset and additional training dataset. The section indices in this dataset are the same as those in the additional training dataset. For more information, please see the pages of the development dataset and the task description. After the DCASE 2021 Challenge, we released the ground truth for this evaluation dataset. Directory structure Once you unzip the downloaded files from Zenodo, you can see the following directory structure. The machine type information is given by directory name, and the section index, domain, and condition information are given by file name, as: /eval_data /fan /source_test (Normal and anomaly data are included, but they do not have a condition label.) /section_03_source_test_0000.wav ... /section_03_source_test_0199.wav /section_04_source_test_0000.wav ... /section_05_source_test_0199.wav /target_test (Normal and anomaly data are included, but they do not have a condition label.) /section_03_target_test_0000.wav ... /section_03_target_test_0199.wav /section_04_target_test_0000.wav ... /section_05_target_test_0199.wav /gearbox (The other machine types have the same directory structure as fan.) /pump /slider /ToyCar /ToyTrain /valve The paths of audio files are: "/eval_data/<machine_type>/source_test/section_[0-9]+_source_test_[0-9]+.wav" "/eval_data/<machine_type>/target_test/section_[0-9]+_target_test_[0-9]+.wav" For example, the machine type, section, and domain of "/fan/source_test/section_03_source_test_0018.wav" are "fan", "section 03", and "source", respectively. Baseline system Two simple baseline systems are available on the Github repository [URL] and [URL]. The baseline systems provide a simple entry-level approach that gives a reasonable performance in the dataset of Task 2. They are good starting points, especially for entry-level researchers who want to get familiar with the anomalous-sound-detection task. Conditions of use This dataset was created jointly by Hitachi, Ltd. and NTT Corporation and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license. Publication If you use this dataset, please cite all the following three papers: Yohei Kawaguchi, Keisuke Imoto, Yuma Koizumi, Noboru Harada, Daisuke Niizumi, Kota Dohi, Ryo Tanabe, Harsh Purohit, and Takashi Endo, "Description and Discussion on DCASE 2021 Challenge Task 2: Unsupervised Anomalous Sound Detection for Machine Condition Monitoring under Domain Shifted Conditions," in arXiv e-prints: 2106.04492, 2021. [URL] Noboru Harada, Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Masahiro Yasuda, Shoichiro Saito, "ToyADMOS2: Another Dataset of Miniature-Machine Operating Sounds for Anomalous Sound Detection under Domain Shift Conditions," in arXiv e-prints: 2106.02369, 2021. [URL] Ryo Tanabe, Harsh Purohit, Kota Dohi, Takashi Endo, Yuki Nikaido, Toshiki Nakamura, and Yohei Kawaguchi, "MIMII DUE: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection with Domain Shifts due to Changes in Operational and Environmental Conditions," in arXiv e-prints: 2105.02702, 2021. [URL] Feedback If there is any problem, please contact us: Yohei Kawaguchi, yohei.kawaguchi.xk@hitachi.com Daisuke Niizumi, daisuke.niizumi.dt@hco.ntt.co.jp Keisuke Imoto, keisuke.imoto@ieee.org
This paper provides a baseline system for First-shot-compliant unsupervised anomaly detection (ASD) for machine condition monitoring. First-shot ASD does not allow systems to do machine-type dependent hyperparameter tuning or tool ensembling based on the performance metric calculated with the grand truth. To show benchmark performance for First-shot ASD, this paper proposes an anomaly sound detection system that works on the domain generalization task in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2022 Challenge Task 2: "Unsupervised Anomalous Sound Detection for Machine Condition Monitoring Applying Domain Generalization Technique" while complying with the First-shot requirements introduced in the DCASE 2023 Challenge Task 2 (DCASE2023T2). A simple autoencoder based implementation combined with selective Mahalanobis metric is implemented as a baseline system. The performance evaluation is conducted to set the target benchmark for the forthcoming DCASE2023T2. Source code of the baseline system will be available on GitHub: https://github.com/nttcslab/dcase2023_task2_baseline_ae .
Description This dataset is the "development dataset" for the DCASE 2022 Challenge Task 2 "Unsupervised Anomalous Sound Detection for Machine Condition Monitoring Applying Domain Generalization Techniques". The data consists of the normal/anomalous operating sounds of seven types of real/toy machines. Each recording is a single-channel 10-second audio that includes both a machine's operating sound and environmental noise. The following seven types of real/toy machines are used in this task: Fan Gearbox Bearing Slide rail ToyCar ToyTrain Valve Overview of the taskAnomalous sound detection (ASD) is the task of identifying whether the sound emitted from a target machine is normal or anomalous. Automatic detection of mechanical failure is an essential technology in the fourth industrial revolution, which involves artificial intelligence (AI)-based factory automation. Prompt detection of machine anomalies by observing sounds is useful for monitoring the condition of machines. This task is the follow-up to DCASE 2020 Task 2 and DCASE 2021 Task 2. The task this year is to detect anomalous sounds under three main conditions: 1. Only normal sound clips are provided as training data (i.e., unsupervised learning scenario). In real-world factories, anomalies rarely occur and are highly diverse. Therefore, exhaustive patterns of anomalous sounds are impossible to create or collect and unknown anomalous sounds that were not observed in the given training data must be detected. This condition is the same as in DCASE 2020 Task 2 and DCASE 2021 Task 2. 2. Factors other than anomalies change the acoustic characteristics between training and test data (i.e., domain shift). In real-world cases, operational conditions of machines or environmental noise often differ between the training and testing phases. For example, the operation speed of a conveyor can change due to seasonal demand, or environmental noise can fluctuate depending on the states of surrounding machines. This condition is the same as in DCASE 2021 Task 2. 3. In test data, samples unaffected by domain shifts (source domain data) and those affected by domain shifts (target domain data) are mixed, and the source/target domain of each sample is not specified. Therefore, the model must detect anomalies regardless of the domain (i.e., domain generalization). Definition We first define key terms in this task: "machine type," "section," "source domain," "target domain," and "attributes.". "Machine type" indicates the kind of machine, which in this task is one of seven: fan, gearbox, bearing, slide rail, valve, ToyCar, and ToyTrain. A section is defined as a subset of the dataset for calculating performance metrics. Each section is dedicated to a specific type of domain shift. The source domain is the domain under which most of the training data and part of the test data were recorded, and the target domain is a different set of domains under which a few of the training data and part of the test data were recorded. There are differences between the source and target domains in terms of operating speed, machine load, viscosity, heating temperature, type of environmental noise, SNR, etc. Attributes are parameters that define states of machines or types of noise. Dataset This dataset consists of three sections for each machine type (Sections 00, 01, and 02), and each section is a complete set of training and test data. For each section, this dataset provides (i) 990 clips of normal sounds in the source domain for training, (ii) ten clips of normal sounds in the target domain for training, and (iii) 100 clips each of normal and anomalous sounds for the test. The source/target domain of each sample is provided. Additionally, the attributes of each sample in the training and test data are provided in the file names and attribute csv files. File names and attribute csv files File names and attribute csv files provide reference labels for each clip. The given reference labels for each training/test clip include machine type, section index, normal/anomaly information, and attributes regarding the condition other than normal/anomaly. The machine type is given by the directory name. The section index is given by their respective file names. For the datasets other than the evaluation dataset, the normal/anomaly information and the attributes are given by their respective file names. Attribute csv files are for easy access to attributes that cause domain shifts. In these files, the file names, name of parameters that cause domain shifts (domain shift parameter, dp), and the value or type of these parameters (domain shift value, dv) are listed. Each row takes the following format: [filename (string)], [d1p (string)], [d1v (int | float | string)], [d2p], [d2v]... Recording procedure Normal/anomalous operating sounds of machines and its related equipment are recorded. Anomalous sounds were collected by deliberately damaging target machines. For simplifying the task, we use only the first channel of multi-channel recordings; all recordings are regarded as single-channel recordings of a fixed microphone. We mixed a target machine sound with environmental noise, and only noisy recordings are provided as training/test data. The environmental noise samples were recorded in several real factory environments. We will publish papers on the dataset to explain the details of the recording procedure by the submission deadline. Directory structure - /dev_data - /fan - /train (only normal clips) - /section_00_source_train_normal_0000_<attribute>.wav - ... - /section_00_source_train_normal_0989_<attribute>.wav - /section_00_target_train_normal_0000_<attribute>.wav - ... - /section_00_target_train_normal_0009_<attribute>.wav - /section_01_source_train_normal_0000_<attribute>.wav - ... - /section_02_target_train_normal_0009_<attribute>.wav - /test - /section_00_source_test_normal_0000_<attribute>.wav - ... - /section_00_source_test_normal_0049_<attribute>.wav - /section_00_source_test_anomaly_0000_<attribute>.wav - ... - /section_00_source_test_anomaly_0049_<attribute>.wav - /section_00_target_test_normal_0000_<attribute>.wav - ... - /section_00_target_test_normal_0049_<attribute>.wav - /section_00_target_test_anomaly_0000_<attribute>.wav - ... - /section_00_target_test_anomaly_0049_<attribute>.wav - /section_01_source_test_normal_0000_<attribute>.wav - ... - /section_02_target_test_anomaly_0049_<attribute>.wav - attributes_00.csv (attribute csv for section 00) - attributes_01.csv (attribute csv for section 01) - attributes_02.csv (attribute csv for section 02) - /gearbox (The other machine types have the same directory structure as fan.) - /bearing - /slider (`slider` means "slide rail") - /ToyCar - /ToyTrain - /valve Baseline system Two baseline systems are available on the Github repository baseline_ae and baseline_mobile_net_v2. The baseline systems provide a simple entry-level approach that gives a reasonable performance in the dataset of Task 2. They are good starting points, especially for entry-level researchers who want to get familiar with the anomalous-sound-detection task. Condition of use This dataset was created jointly by Hitachi, Ltd. and NTT Corporation and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license. Citation If you use this dataset, please cite all the following three papers. Kota Dohi, Keisuke Imoto, Noboru Harada, Daisuke Niizumi, Yuma Koizumi, Tomoya Nishida, Harsh Purohit, Takashi Endo, Masaaki Yamamoto, Yohei Kawaguchi, Description and Discussion on DCASE 2022 Challenge Task 2: Unsupervised Anomalous Sound Detection for Machine Condition Monitoring Applying Domain Generalization Techniques. In arXiv e-prints: 2206.05876, 2022. [URL] Kota Dohi, Tomoya Nishida, Harsh Purohit, Ryo Tanabe, Takashi Endo, Masaaki Yamamoto, Yuki Nikaido, and Yohei Kawaguchi. MIMII DG: sound dataset for malfunctioning industrial machine investigation and inspection for domain generalization task.In arXiv e-prints: 2205.13879, 2022. [URL] Noboru Harada, Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Masahiro Yasuda, and Shoichiro Saito. ToyADMOS2: another dataset of miniature-machine operating sounds for anomalous sound detection under domain shift conditions. In Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021), 1–5. Barcelona, Spain, November 2021. [URL] Contact If there is any problem, please contact us: Kota Dohi, kota.dohi.gr@hitachi.com Daisuke Niizumi, daisuke.niizumi.dt@hco.ntt.co.jp Yohei Kawaguchi, yohei.kawaguchi.xk@hitachi.com Keisuke Imoto, keisuke.imoto@ieee.org
Many application studies rely on audio DNN models pre-trained on a large-scale dataset as essential feature extractors, and they extract features from the last layers. In this study, we focus on our finding that the middle layer features of existing supervised pre-trained models are more effective than the late layer features for some tasks. We propose a simple approach to compose features effective for general-purpose applications, consisting of two steps: (1) calculating feature vectors along the time frame from middle/late layer outputs, and (2) fusing them. This approach improves the utility of frequency and channel information in downstream processes, and combines the effectiveness of middle and late layer features for different tasks. As a result, the feature vectors become effective for general purposes. In the experiments using VGGish, PANNs' CNN14, and AST on nine downstream tasks, we first show that each layer output of these models serves different tasks. Then, we demonstrate that the proposed approach significantly improves their performance and brings it to a level comparable to that of the state-of-the-art. In particular, the performance of the non-semantic speech (NOSS) tasks greatly improves, especially on Speech commands V2 with VGGish of +77.1 (14.3% to 91.4%).
It is challenging to deploy Transformer-based audio classification models on common terminal devices in real situations due to their high computational costs, increasing the importance of transferring knowledge from the larger Transformer-based model to the smaller convolutional neural networks (CNN)-based model via knowledge distillation (KD). Since an audio spectrogram can be regarded as an image, image-based models with CNN-based structures are used as the aforementioned smaller model for KD in several studies. However, the physical meanings of spectrograms differ from that of images in general. This fact possibly leads to the issue that the image-based model may not effectively extract features from a pure spectrogram. Thus, improving the spectrogram can help these models perform better on audio classification tasks. To implement our hypothesis, we propose a new Time-Frequency Enhancer (TFE), which is designed to learn how to enhance input spectrograms to make them effective for audio classification. In addition, we also propose TFE-ENV2, which extends EfficientNetV2 (ENV2), an image-based backbone model. To verify the effectiveness of the proposed method, we compare the performance between the original ENV2 and the proposed TFE-ENV2. In our experiments, the proposed TFE-ENV2 outperformed the original ENV2 on the ESC-50 and Speech Commands V2 datasets, demonstrating that the proposed TFE enhances spectrograms to assist image-based models in audio classification.
Pre-trained models are essential as feature extractors in modern machine learning systems in various domains. In this study, we hypothesize that representations effective for general audio tasks should provide multiple aspects of robust features of the input sound. For recognizing sounds regardless of perturbations such as varying pitch or timbre, features should be robust to these perturbations. For serving the diverse needs of tasks such as recognition of emotions or music genres, representations should provide multiple aspects of information, such as local and global features. To implement our principle, we propose a self-supervised learning method: Bootstrap Your Own Latent (BYOL) for Audio (BYOL-A, pronounced “viola”). BYOL-A pre-trains representations of the input sound invariant to audio data augmentations, which makes the learned representations robust to the perturbations of sounds. Whereas the BYOL-A encoder combines local and global features and calculates their statistics to make the representation provide multi-aspect information. As a result, the learned representations should provide robust and multi-aspect information to serve various needs of diverse tasks. We evaluated the general audio task performance of BYOL-A compared to previous state-of-the-art methods, and BYOL-A demonstrated generalizability with the best average result of 72.4% and the best VoxCeleb1 result of 57.6%. Extensive ablation experiments revealed that the BYOL-A encoder architecture contributes to most performance, and the final critical portion resorts to the BYOL framework and BYOL-A augmentations. Our code is available online for future studies.
We proposed Audio Difference Captioning (ADC) as a new extension task of audio captioning for describing the semantic differences between input pairs of similar but slightly different audio clips. The ADC solves the problem that conventional audio captioning sometimes generates similar captions for similar audio clips, failing to describe the difference in content. We also propose a cross-attention-concentrated transformer encoder to extract differences by comparing a pair of audio clips and a similarity-discrepancy disentanglement to emphasize the difference in the latent space. To evaluate the proposed methods, we built an AudioDiffCaps dataset consisting of pairs of similar but slightly different audio clips with human-annotated descriptions of their differences. The experiment with the AudioDiffCaps dataset showed that the proposed methods solve the ADC task effectively and improve the attention weights to extract the difference by visualizing them in the transformer encoder.