Large-scale learning of shape and motion models for the 3D face

2020 
Data-driven models of the 3D face are a promising direction for capturing the subtle complexities of the human face, and a central component to numerous applications thanks to their ability to simplify complex tasks. Most data-driven approaches to date were built from either a relatively limited number of samples or by synthetic data augmentation, mainly because of the difficulty in obtaining large-scale and accurate 3D scans of the face. Yet, there is a substantial amount of information that can be gathered when considering publicly available sources that have been captured over the last decade, whose combination can potentially bring forward more powerful models.This thesis proposes novel methods for building data-driven models of the 3D face geometry, and investigates whether improved performances can be obtained by learning from large and varied datasets of 3D facial scans. In order to make efficient use of a large number of training samples we develop novel deep learning techniques designed to effectively handle three-dimensional face data. We focus on several aspects that influence the geometry of the face: its shape components including fine details, its motion components such as expression, and the interaction between these two subspaces.We develop in particular two approaches for building generative models that decouple the latent space according to natural sources of variation, e.g.identity and expression. The first approach considers a novel deep autoencoder architecture that allows to learn a multilinear model without requiring the training data to be assembled as a complete tensor. We next propose a novel non-linear model based on adversarial training that further improves the decoupling capacity. This is enabled by a new 3D-2D architecture combining a 3D generator with a 2D discriminator, where both domains are bridged by a geometry mapping layer.As a necessary prerequisite for building data-driven models, we also address the problem of registering a large number of 3D facial scans in motion. We propose an approach that can efficiently and automatically handle a variety of sequences while making minimal assumptions on the input data. This is achieved by the use of a spatiotemporal model as well as a regression-based initialization, and we show that we can obtain accurate registrations in an efficient and scalable manner.Finally, we address the problem of recovering surface normals from natural images, with the goal of enriching existing coarse 3D reconstructions. We propose a method that can leverage all available image and normal data, whether paired or not, thanks to a new cross-modal learning architecture. Core to our approach is a novel module that we call deactivable skip connections, which allows to transfer the local details from the image to the output surface without hurting the performance when autoencoding modalities, achieving state-of-the-art results for the task.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []