Abstract In unsupervised video retargeting, content is transferred from one video to another while preserving the original appearance and style, without any additional annotations. While this challenge has seen substantial advancements through the use of deep neural networks, current methods struggle when the source and target videos are of shapes that are different in limb lengths or other body proportions. In this work, we consider this task for the case of objects of different shapes and appearances, that consist of similar skeleton connectivity and depict similar motion. We introduce JOKR—a JOint Keypoint Representation that captures the geometry common to both videos, while being disentangled from their unique styles. Our model first extracts unsupervised keypoints from the given videos. From this representation, two decoders reconstruct geometry and appearance, one for each of the input sequences. By employing an affine‐invariant domain confusion term over the keypoints bottleneck, we enforce the unsupervised keypoint representations of both videos to be indistinguishable. This encourages the aforementioned disentanglement between motion and appearance, mapping similar poses from both domains to the same representation. This allows yielding a sequence with the appearance and style of one video, but the content of the other. Our applicability is demonstrated through challenging video pairs compared to state‐of‐the‐art methods. Furthermore, we demonstrate that this geometry‐driven representation enables intuitive control, such as temporal coherence and manual pose editing. Videos can be viewed in the supplement HTML.
Diffusion models have opened the path to a wide range of text-based image editing frameworks. However, these typically build on the multi-step nature of the diffusion backwards process, and adapting them to distilled, fast-sampling methods has proven surprisingly challenging. Here, we focus on a popular line of text-based editing frameworks - the ``edit-friendly'' DDPM-noise inversion approach. We analyze its application to fast sampling methods and categorize its failures into two classes: the appearance of visual artifacts, and insufficient editing strength. We trace the artifacts to mismatched noise statistics between inverted noises and the expected noise schedule, and suggest a shifted noise schedule which corrects for this offset. To increase editing strength, we propose a pseudo-guidance approach that efficiently increases the magnitude of edits without introducing new artifacts. All in all, our method enables text-based image editing with as few as three diffusion steps, while providing novel insights into the mechanisms behind popular text-based editing approaches.
We advocate the use of point sets to represent shapes. We provide a definition of a smooth manifold surface from a set of points close to the original surface. The definition is based on local maps from differential geometry, which are approximated by the method of moving least squares (MLS). The computation of points on the surface is local, which results in an out-of-core technique that can handle any point set. We show that the approximation error is bounded and present tools to increase or decrease the density of the points, thus allowing an adjustment of the spacing among the points to control the error. To display the point set surface, we introduce a novel point rendering technique. The idea is to evaluate the local maps according to the image resolution. This results in high quality shading effects and smooth silhouettes at interactive frame rates.
Content creators often aim to create personalized images using personal subjects that go beyond the capabilities of conventional text-to-image models. Additionally, they may want the resulting image to encompass a specific location, style, ambiance, and more. Existing personalization methods may compromise personalization ability or the alignment to complex textual prompts. This trade-off can impede the fulfillment of user prompts and subject fidelity. We propose a new approach focusing on personalization methods for a \emph{single} prompt to address this issue. We term our approach prompt-aligned personalization. While this may seem restrictive, our method excels in improving text alignment, enabling the creation of images with complex and intricate prompts, which may pose a challenge for current techniques. In particular, our method keeps the personalized model aligned with a target prompt using an additional score distillation sampling term. We demonstrate the versatility of our method in multi- and single-shot settings and further show that it can compose multiple subjects or use inspiration from reference images, such as artworks. We compare our approach quantitatively and qualitatively with existing baselines and state-of-the-art techniques.
We introduce an interactive technique to extract and manipulate simple 3D shapes in a single photograph. Such extraction requires an understanding of the shape's components, their projections, and their relationships. These cognitive tasks are simple for humans, but particularly difficult for automatic algorithms. Thus, our approach combines the cognitive abilities of humans with the computational accuracy of the machine to create a simple modeling tool. In our interface, the human draws three strokes over the photograph to generate a 3D component that snaps to the outline of the shape. Each stroke defines one dimension of the component. Such human assistance implicitly segments a complex object into its components, and positions them in space. The computer reshapes the component to fit the image of the object in the photograph as well as to satisfy various inferred geometric constraints between components imposed by a global 3D structure. We show that this intelligent interactive modeling tool provides the means to create editable 3D parts quickly. Once the 3D object has been extracted, it can be quickly edited and placed back into photos or 3D scenes, permitting object-driven photo editing tasks which are impossible to perform in image-space.
We review methods designed to compute correspondences between geometric shapes represented by triangle meshes, contours or point sets. This survey is motivated in part by recent developments in space–time registration, where one seeks a correspondence between non-rigid and time-varying surfaces, and semantic shape analysis, which underlines a recent trend to incorporate shape understanding into the analysis pipeline. Establishing a meaningful correspondence between shapes is often difficult because it generally requires an understanding of the structure of the shapes at both the local and global levels, and sometimes the functionality of the shape parts as well. Despite its inherent complexity, shape correspondence is a recurrent problem and an essential component of numerous geometry processing applications. In this survey, we discuss the different forms of the correspondence problem and review the main solution methods, aided by several classification criteria arising from the problem definition. The main categories of classification are defined in terms of the input and output representation, objective function and solution approach. We conclude the survey by discussing open problems and future perspectives.
Motion capture technology has enabled the acquisition of high quality human motions for animating digital characters with extremely high fidelity. However, despite all the advances in motion editing and synthesis, it remains an open problem to modify pre-captured motions that are highly expressive, such as contemporary dances, for stylization and emotionalization. In this work, we present a novel approach for stylizing such motions by using emotion coordinates defined by the Russell's Circumplex Model (RCM). We extract and analyze a large set of body and motion features, based on the Laban Movement Analysis (LMA), and choose the effective and consistent features for characterizing emotions of motions. These features provide a mechanism not only for deriving the emotion coordinates of a newly input motion, but also for stylizing the motion to express a different emotion without having to reference the training data. Such decoupling of the training data and new input motions eliminates the necessity of manual processing and motion registration. We implement the two-way mapping between the motion features and emotion coordinates through Radial Basis Function (RBF) regression and interpolation, which can stylize free-style highly dynamic dance movements at interactive rates. Our results and user studies demonstrate the effectiveness of the stylization framework with a variety of dance movements exhibiting a diverse set of emotions.
The long-coveted task of reconstructing 3D geometry from images is still a standing problem. In this paper, we build on the power of neural networks and introduce Pix2Vex, a network trained to convert camera-captured images into 3D geometry. We present a novel differentiable renderer ($DR$) as a forward validation means during training. Our key insight is that $DR$s produce images of a particular appearance, different from typical input images. Hence, we propose adding an image-to-image translation component, converting between these rendering styles. This translation closes the training loop, while allowing to use minimal supervision only, without needing any 3D model as ground truth. Unlike state-of-the-art methods, our $DR$ is $C^\infty$ smooth and thus does not display any discontinuities at occlusions or dis-occlusions. Through our novel training scheme, our network can train on different types of images, where previous work can typically only train on images of a similar appearance to those rendered by a $DR$.
Abstract Classical approaches to shape correspondence base their computation purely on the properties, in particular geometric similarity, of the shapes in question. Their performance still falls far short of that of humans in challenging cases where corresponding shape parts may differ significantly in geometry or even topology. We stipulate that in these cases, shape correspondence by humans involves recognition of the shape parts where prior knowledge on the parts would play a more dominant role than geometric similarity. We introduce an approach to part correspondence which incorporates prior knowledge imparted by a training set of pre‐segmented, labeled models and combines the knowledge with content‐driven analysis based on geometric similarity between the matched shapes. First, the prior knowledge is learned from the training set in the form of per‐label classifiers. Next, given two query shapes to be matched, we apply the classifiers to assign a probabilistic label to each shape face. Finally, by means of a joint labeling scheme, the probabilistic labels are used synergistically with pairwise assignments derived from geometric similarity to provide the resulting part correspondence. We show that the incorporation of knowledge is especially effective in dealing with shapes exhibiting large intra‐class variations. We also show that combining knowledge and content analyses outperforms approaches guided by either attribute alone.