Attribute assisted teacher-critical training strategies for image captioning

2022 
Existing image captioning models are usually trained by cross-entropy (XE) loss and (RL), which set ground-truth words as hard targets and force the captioning model to learn from them. However, the widely adopted training strategies may suffer from misalignment in XE training and inappropriate reward assignment in RL training. To tackle these problems, we introduce an attribute enhanced teacher model that serves as a bridge between the ground-truth captions and the captioning model by generating some easier-to-learn word proposals as soft targets. Currently, most knowledge distillation methods build the teacher model by introducing more model parameters as well as additional training data. In our proposal, we alternatively construct the teacher model by utilizing the ground-truth image attributes which already exist in the ground-truth captions and can be very easily extracted. To effectively learn from the teacher model, we further propose Teacher-Critical Training Strategies (TCTS) for both XE and RL training to facilitate more efficient learning of the captioning model. Experimental evaluations of several widely adopted captioning model architectures on the benchmark MSCOCO dataset show the proposed TCTS comprehensively outperforms these baselines in both subjective metrics and human evaluations. Our codes and pre-trained models will be open-sourced.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []