Realistic speech-driven 3D facial animation is a challenging problem due to the complex relationship between speech and face. In this paper, we propose a deep architecture, called Geometry-guided Dense Perspective Network (GDPnet), to achieve speaker-independent realistic 3D facial animation. The encoder is designed with dense connections to strengthen feature propagation and encourage the re-use of audio features, and the decoder is integrated with an attention mechanism to adaptively recalibrate point-wise feature responses by explicitly modeling interdependencies between different neuron units. We also introduce a non-linear face reconstruction representation as a guidance of latent space to obtain more accurate deformation, which helps solve the geometry-related deformation and is good for generalization across subjects. Huber and HSIC (Hilbert-Schmidt Independence Criterion) constraints are adopted to promote the robustness of our model and to better exploit the non-linear and high-order correlations. Experimental results on the public dataset and real scanned dataset validate the superiority of our proposed GDPnet compared with state-of-the-art model. The code is available for research purposes at http://cic.tju.edu.cn/faculty/likun/projects/GDPnet.
Humans commonly work with multiple objects in daily life and can intuitively transfer manipulation skills to novel objects by understanding object functional regularities. However, existing technical approaches for analyzing and synthesizing hand-object manipulation are mostly limited to handling a single hand and object due to the lack of data support. To address this, we construct TACO, an extensive bimanual hand-object-interaction dataset spanning a large variety of tool-action-object compositions for daily human activities. TACO contains 2.5K motion sequences paired with third-person and egocentric views, precise hand-object 3D meshes, and action labels. To rapidly expand the data scale, we present a fully automatic data acquisition pipeline combining multi-view sensing with an optical motion capture system. With the vast research fields provided by TACO, we benchmark three generalizable hand-object-interaction tasks: compositional action recognition, generalizable hand-object motion forecasting, and cooperative grasp synthesis. Extensive experiments reveal new insights, challenges, and opportunities for advancing the studies of generalizable hand-object motion analysis and synthesis. Our data and code are available at https://taco2024.github.io.
Reconstructing hand-held objects from monocular RGB images is an appealing yet challenging task. In this task, contacts between hands and objects provide important cues for recovering the 3D geometry of the hand-held objects. Though recent works have employed implicit functions to achieve impressive progress, they ignore formulating contacts in their frameworks, which results in producing less realistic object meshes. In this work, we explore how to model contacts in an explicit way to benefit the implicit reconstruction of hand-held objects. Our method consists of two components: explicit contact prediction and implicit shape reconstruction. In the first part, we propose a new subtask of directly estimating 3D hand-object contacts from a single image. The part-level and vertex-level graph-based transformers are cascaded and jointly learned in a coarse-to-fine manner for more accurate contact probabilities. In the second part, we introduce a novel method to diffuse estimated contact states from the hand mesh surface to nearby 3D space and leverage diffused contact probabilities to construct the implicit neural representation for the manipulated object. Benefiting from estimating the interaction patterns between the hand and the object, our method can reconstruct more realistic object meshes, especially for object parts that are in contact with hands. Extensive experiments on challenging benchmarks show that the proposed method outperforms the current state of the arts by a great margin. Our code is publicly available at https://junxinghu.github.io/projects/hoi.html.
The latest trends in the research field of single-view human reconstruction devote to learning deep implicit functions constrained by explicit body shape priors. Despite the remarkable performance improvements compared with traditional processing pipelines, existing learning approaches still show different aspects of limitations in terms of flexibility, generalizability, robustness, and/or representation capability. To comprehensively address the above issues, in this paper, we investigate an explicit point-based human reconstruction framework called HaP, which adopts point clouds as the intermediate representation of the target geometric structure. Technically, our approach is featured by fully-explicit point cloud estimation, manipulation, generation, and refinement in the 3D geometric space, instead of an implicit learning process that can be ambiguous and less controllable. The overall workflow is carefully organized with dedicated designs of the corresponding specialized learning components as well as processing procedures. Extensive experiments demonstrate that our framework achieves quantitative performance improvements of 20% to 40% over current state-of-the-art methods, and better qualitative results. Our promising results may indicate a paradigm rollback to the fully-explicit and geometry-centric algorithm design, which enables to exploit various powerful point cloud modeling architectures and processing techniques. We will make our code and data publicly available at https://github.com/yztang4/HaP.
This paper proposes a new method for fast human motion capture based on a single RGB-D sensor. By leveraging the human pose detection results for reinitializing the ICP-based sequential human motion tracking algorithm when tracking failure happens, our system achieves a highly robust and accurate human motion tracking performance even for fast motion. Moreover, the calculation and utilization of semantic tracking loss enable body-part-level motion tracking refinement, which is better than whole-body refinement. Finally, a simple yet effective post-processing method, semantic bidirectional motion blending, for single-view human motion capture is proposed to further improve the tracking accuracy, especially under severe occlusions and fast motion. The results and experiments demonstrate that the proposed method achieves highly accurate and robust human motion capture performance in a very efficient way. Applications include AR&VR, human motion analysis, movie, and gaming.
In this study, we propose a modeling-based compression approach for dense/lenslet light field images captured by Plenoptic 2.0 with square microlenses. This method employs the 5-D Epanechnikov Kernel (5-D EK) and its associated theories. Owing to the limitations of modeling larger image block using the Epanechnikov Mixture Regression (EMR), a 5-D Epanechnikov Mixture-of-Experts using Gaussian Initialization (5-D EMoE-GI) is proposed. This approach outperforms 5-D Gaussian Mixture Regression (5-D GMR). The modeling aspect of our coding framework utilizes the entire EI and the 5D Adaptive Model Selection (5-D AMLS) algorithm. The experimental results demonstrate that the decoded rendered images produced by our method are perceptually superior, outperforming High Efficiency Video Coding (HEVC) and JPEG 2000 at a bit depth below 0.06bpp.