Seeing and Hearing Egocentric Actions: How Much Can We Learn?

Alejandro Cartas,Jordi Luque,Petia Radeva,Carlos Segura,Mariella Dimiccoli

Seeing and Hearing Egocentric Actions: How Much Can We Learn?

2019

Alejandro Cartas
Jordi Luque
Petia Radeva
Carlos Segura
Mariella Dimiccoli

Our interaction with the world is an inherently multimodal experience. However, the understanding of human-to-object interactions has historically been addressed focusing on a single modality. In particular, a limited number of works have considered to integrate the visual and audio modalities for this purpose. In this work, we propose a multimodal approach for egocentric action recognition in a kitchen environment that relies on audio and visual information. Our model combines a sparse temporal sampling strategy with a late fusion of audio, spatial, and temporal streams. Experimental results on the EPIC-Kitchens dataset show that multimodal integration leads to better performance than unimodal approaches. In particular, we achieved a 5.18% improvement over the state of the art on verb classification.

Keywords:

Modalities
Computer science
Human–computer interaction
Artificial intelligence
Sampling (statistics)
Machine learning
Verb
action recognition

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations