The collar is an important part of a garment that reflects its style. The collar classification task is to recognize the collar type in the apparel image. In this paper, we design a novel convolutional module called MFA (multi-scale features attention) to address the problems of high noise, small recognition target and unsatisfactory classification effect in collar feature recognition, which first extracts multi-scale features from the input feature map and then encodes them into an attention weight vector to enhance the representation of important parts, thus improving the ability of the convolutional block to combat noise and extract small target object features. It also reduces the computational overhead of the MFA module by using the depth-separable convolution method. Experiments on the collar dataset Collar6 and the apparel dataset DeepFashion6 (a subset of the DeepFashion database) show that MFANet is able to perform at a relatively small number of collars. MFANet can achieve better classification performance than most current mainstream convolutional neural networks for complex collar images with less computational overhead. Experiments on the standard dataset CIFAR-10 show that MFANet also outperforms current mainstream image classification algorithms.
Vision transformers have been successfully applied to image recognition tasks due to their ability to capture long-range dependencies within an image. However, they still suffer from weak local feature extraction, easy loss of channel interaction information in one-dimensional multi-head self-attention modelling, and large number of parameters. The text proposes a new hybrid architecture EPSViTs (Efficient Parameter Shared Transformer, EPSViTs). Firstly, a new local feature extraction module is designed to effectively enhance the expression of local features. Secondly, using the parameter sharing approach, a multi-head self-attention module based on information interaction is designed, which can globally model the image from both spatial and channel dimensions, and mine the potential correlation of the image in space and channel. Extensive experiments are conducted on three public datasets, a subset of ImageNet, Cifar100 and APTOS2019, a private dataset Mushroom66, and the results show that the hybrid architecture EPSViTs proposed in this paper based on parameter sharing for multi-head self-attentive image classification has obvious advantages, especially on a subset of ImageNet to reach 89.18%, which is a 3.8% improvement compared to Edgevits xxs, verifying the effectiveness of the model.