Multimodal Item Categorization Fully Based on Transformer

Lei Chen,Houwei Chou,Yandi Xia,Hirokazu Miyake

Multimodal Item Categorization Fully Based on Transformer

2021

Lei Chen
Houwei Chou
Yandi Xia
Hirokazu Miyake

The Transformer has proven to be a powerful feature extraction method and has gained widespread adoption in natural language processing (NLP). In this paper we propose a multimodal item categorization (MIC) system solely based on the Transformer for both text and image processing. On a multimodal product data set collected from a Japanese e-commerce giant, we tested a new image classification model based on the Transformer and investigated different ways of fusing bi-modal information. Our experimental results on real industry data showed that the Transformer-based image classifier has performance on par with ResNet-based classifiers and is four times faster to train. Furthermore, a cross-modal attention layer was found to be critical for the MIC system to achieve performance gains over text-only and image-only models.

Keywords:

image classifier
Pattern recognition
Artificial intelligence
transformer
Contextual image classification
Set (abstract data type)
Layer (object-oriented design)
Categorization
Computer science
Feature extraction
Image processing

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations