Intrinsic Temporal Regularization for High-resolution Human Video Synthesis

2021 
Fashion video synthesis has attracted increasing attention due to its huge potential in immersive media, virtual reality and online retail applications, yet traditional 3D graphic pipelines often require extensive manual labor on data capture and model rigging. In this paper, we investigate an image-based approach to this problem that generates a fashion video clip from a still source image of the desired outfit, which is then rigged in a framewise fashion under the guidance of a driving video. A key challenge for this task lies in the modeling of feature transformation across source and driving frames, where fine-grained transform helps promote visual details at garment regions, but often at the expense of intensified temporal flickering. To resolve this dilemma, we propose a novel framework with 1) a multi-scale transform estimation and feature fusion module to preserve fine-grained garment details, and 2) an intrinsic regularization loss to enforce temporal consistency of learned transform between adjacent frames. Our solution is capable of generating 512\times512 fashion videos with rich garment details and smooth fabric movements beyond existing results. Extensive experiments over the FashionVideo benchmark dataset have demonstrated the superiority of the proposed framework over several competitive baselines.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    44
    References
    0
    Citations
    NaN
    KQI
    []