TransFusion: Multi-Modal Fusion for Video Tag Inference via Translation-based Knowledge Embedding

2021 
Tag inference is an important task in the business of video platforms with wide applications such as recommendation, interpretation, and more. Existing works are mainly based on extracting video information from multiple modalities such as frames or music, and then infer tags through classification or object detection. This, however, does not apply to inferring generic tags or taxonomy that are less relevant to video contents, such as video originality or its broader category, which are important in practice. In this paper, we claim that these generic tags can be modeled through the semantic relations between videos and tags, and can be utilized simultaneously with the multi-modal features to achieve better video tagging. We propose TransFusion, an end-to-end supervised learning framework that fuses multi-modal embeddings (e.g., vision, audio, texts, etc.) with the knowledge embedding to derive the video representation. To infer the diverse tags following heterogeneous relations, TransFusion adopts a dual attentive approach to learn both the modality importance in fusion and relation importance in inference. Besides, it is general enough and can be used with the existing translation-based knowledge embedding approaches. Extensive experiments show that TransFusion outperforms the baseline methods with lowered mean rank and at least 9.59% improvement in HITS@10 on the real-world video knowledge graph.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    31
    References
    0
    Citations
    NaN
    KQI
    []