Long Tail Visual Relationship Recognition with Hubless Regularized Relmix

2020 
Several approaches have been proposed in recent literature to alleviate the long-tail problem, mostly in the object classification task. We propose to study the task of Long-Tail Visual Relationship Recognition (LTVRR), which aims at generalizing on the structured long-tail distribution of visual relationships (e.g., "rabbit grazing on grass"). In this setup, subject, relation, and object classes individually follow a long-tail distribution. We first introduce two large-scale long-tail visual relationship recognition benchmarks to study this task, dubbed as VG8K-LT (5330 objects, 2000 relationships) and GQA-LT (1703 objects, 310 relations). VG8K-LT and GQA-LT are built upon the widely used Visual Genome and GQA datasets. In contrast to existing benchmarks, some classes appear at a very low frequency ($1-14$ examples). We use these benchmarks to study the performance of several state-of-the-art long-tail models on LTVRR setup. We developed a visiolinguistic hubless (ViLHub) loss that consistently encourages visual classifiers to be more predictive of tail classes while being accurate on the head. We also propose relationship Mixup augmentation, dubbed as RelMix, to improve performance on the tail on VG8K-LT and GQA-LT benchmarks with the best performance achieved when combined with ViLHub loss. Benchmarks and code will be made available.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    83
    References
    4
    Citations
    NaN
    KQI
    []