Text-to-Audio Grounding: Building Correspondence Between Captions and Sound Events

Xuenan Xu,Heinrich Dinkel,Mengyue Wu,Kai Yu

Text-to-Audio Grounding: Building Correspondence Between Captions and Sound Events

2021

Xuenan Xu
Heinrich Dinkel
Mengyue Wu
Kai Yu

Automated Audio Captioning is a cross-modal task, generating natural language descriptions to summarize the audio clips' sound events. However, grounding the actual sound events in the given audio based on its corresponding caption has not been investigated. This paper contributes an AudioGrounding dataset, which provides the correspondence between sound events and the captions provided in Audiocaps, along with the location (timestamps) of each present sound event. Based on such, we propose the text-to-audio grounding (TAG) task, which interactively considers the relationship between audio processing and language understanding. A baseline approach is provided, resulting in an event-F1 score of 28.3% and a Polyphonic Sound Detection Score (PSDS) score of 14.7%.

Keywords:

Event (computing)
Audio signal processing
Natural language
Sound detection
task
Sound (geography)
Timestamp
Natural language processing
Artificial intelligence
Computer science
Closed captioning

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations