Hybrid Fusion with Intra- and Cross-Modality Attention for Image-Recipe Retrieval

2021 
Image-recipe retrieval, which aims at retrieving the relevant recipe from a food image and vice versa, is now attracting widespread attention, since sharing food-related images and recipes on the Internet has become a popular trend. Existing methods have formulated this problem as a typical cross-modal retrieval task by learning the image-recipe similarity. Though these methods have made inspiring achievements for image-recipe retrieval, they may still be less effective to jointly incorporate the three crucial points: (1) the association between ingredients and instructions, (2) fine-grained image information, and (3) the latent alignment between recipes and images. To this end, we propose a novel framework namedHybrid Fusion with Intra- and Cross-Modality Attention (HF-ICMA) to learn accurate image-recipe similarity. Our HF-ICMA model adopts an intra-recipe fusion module to focus on the interaction between ingredients and instructions within a recipe, and further enriches the expressions of the two separate embeddings. Meanwhile, an image-recipe fusion module is devised to explore the potential relationship between fine-grained image regions and ingredients from the recipe, which jointly forms the final image-recipe similarity from both the local and global aspects. Extensive experiments on the large-scale benchmark dataset Recipe1M show that our model significantly outperforms the state-of-the-art approaches on various image-recipe retrieval scenarios.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    50
    References
    0
    Citations
    NaN
    KQI
    []