Content creators often aim to create personalized images using personal subjects that go beyond the capabilities of conventional text-to-image models. Additionally, they may want the resulting image to encompass a specific location, style, ambiance, and more. Existing personalization methods may compromise personalization ability or the alignment to complex textual prompts. This trade-off can impede the fulfillment of user prompts and subject fidelity. We propose a new approach focusing on personalization methods for a \emph{single} prompt to address this issue. We term our approach prompt-aligned personalization. While this may seem restrictive, our method excels in improving text alignment, enabling the creation of images with complex and intricate prompts, which may pose a challenge for current techniques. In particular, our method keeps the personalized model aligned with a target prompt using an additional score distillation sampling term. We demonstrate the versatility of our method in multi- and single-shot settings and further show that it can compose multiple subjects or use inspiration from reference images, such as artworks. We compare our approach quantitatively and qualitatively with existing baselines and state-of-the-art techniques.
We introduce an interactive technique to extract and manipulate simple 3D shapes in a single photograph. Such extraction requires an understanding of the shape's components, their projections, and their relationships. These cognitive tasks are simple for humans, but particularly difficult for automatic algorithms. Thus, our approach combines the cognitive abilities of humans with the computational accuracy of the machine to create a simple modeling tool. In our interface, the human draws three strokes over the photograph to generate a 3D component that snaps to the outline of the shape. Each stroke defines one dimension of the component. Such human assistance implicitly segments a complex object into its components, and positions them in space. The computer reshapes the component to fit the image of the object in the photograph as well as to satisfy various inferred geometric constraints between components imposed by a global 3D structure. We show that this intelligent interactive modeling tool provides the means to create editable 3D parts quickly. Once the 3D object has been extracted, it can be quickly edited and placed back into photos or 3D scenes, permitting object-driven photo editing tasks which are impossible to perform in image-space.
We present an approach for extracting reliefs and details from relief surfaces. We consider a relief surface as a surface composed of two components: a base surface and a height function which is defined over this base. However, since the base surface is unknown, the decoupling of these components is a challenge. We show how to estimate a robust height function over the base, without explicitly extracting the base surface. This height function is utilized to separate the relief from the base. Several applications benefiting from this extraction are demonstrated, including relief segmentation, detail exaggeration and dampening, copying of details from one object to another, and curve drawing on meshes.
Median-shift is a mode seeking algorithm that relies on computing the median of local neighborhoods, instead of the mean. We further combine median-shift with Locality Sensitive Hashing (LSH) and show that the combined algorithm is suitable for clustering large scale, high dimensional data sets. In particular, we propose a new mode detection step that greatly accelerates performance. In the past, LSH was used in conjunction with mean shift only to accelerate nearest neighbor queries. Here we show that we can analyze the density of the LSH bins to quickly detect potential mode candidates and use only them to initialize the median-shift procedure. We use the median, instead of the mean (or its discrete counterpart - the medoid) because the median is more robust and because the median of a set is a point in the set. A median is well defined for scalars but there is no single agreed upon extension of the median to high dimensional data. We adopt a particular extension, known as the Tukey median, and show that it can be computed efficiently using random projections of the high dimensional data onto 1D lines, just like LSH, leading to a tightly integrated and efficient algorithm.
We introduce anchored radial observations (ARO), a novel shape encoding for learning implicit field representation of 3D shapes that is category-agnostic and generalizable amid significant shape variations. The main idea behind our work is to reason about shapes through partial observations from a set of viewpoints, called anchors. We develop a general and unified shape representation by employing a fixed set of anchors, via Fibonacci sampling, and designing a coordinate-based deep neural network to predict the occupancy value of a query point in space. Differently from prior neural implicit models that use global shape feature, our shape encoder operates on contextual, query-specific features. To predict point occupancy, locally observed shape information from the perspective of the anchors surrounding the input query point are encoded and aggregated through an attention module, before implicit decoding is performed. We demonstrate the quality and generality of our network, coined ARO-Net, on surface reconstruction from sparse point clouds, with tests on novel and unseen object categories, "one-shape" training, and comparisons to state-of-the-art neural and classical methods for reconstruction and tessellation.
Effective resizing of images should not only use geometric constraints, but consider the image content as well. We present a simple image operator called seam carving that supports content-aware image resizing for both reduction and expansion. A seam is an optimal 8-connected path of pixels on a single image from top to bottom, or left to right, where optimality is defined by an image energy function. By repeatedly carving out or inserting seams in one direction we can change the aspect ratio of an image. By applying these operators in both directions we can retarget the image to a new size. The selection and order of seams protect the content of the image, as defined by the energy function. Seam carving can also be used for image content enhancement and object removal. We support various visual saliency measures for defining the energy of an image, and can also include user input to guide the process. By storing the order of seams in an image we create multi-size images, that are able to continuously change in real time to fit a given size.
Most approaches for scene parsing, recognition or retrieval use detectors that are either (i) independently trained or (ii) jointly trained for conjunctions of object-object or object-attribute phrases. We posit that neither of these two extremes is uniformly optimal, in terms of performance, across all categories and conjunctions. The choice of whether one should train an independent or composite detector should be made for each possible conjunction separately, and depends on the statistics of the dataset as well. For example, person holding phone may be more accurately modeled using a single composite detector, while tall person may be more accurately modeled as combination of two detectors. We extensively study this issue in the context of multiple problems and datasets. Further, for e ciency, we propose a predictor that is based on a number of category speci c features (e.g., sample size, entropy, etc.) for whether independent or joint composite detector may be more accurate for a given conjunction. We show that our prediction and selection mechanism generalizes and leads to improved performance on a number of large-scale datasets and vision tasks.
We present an example-based surface reconstruction method for scanned point sets. Our approach uses a database of local shape priors built from a set of given context models that are chosen specifically to match a specific scan. Local neighborhoods of the input scan are matched with enriched patches of these models at multiple scales. Hence, instead of using a single prior for reconstruction, our method allows specific regions in the scan to match the most relevant prior that fits best. Such high confidence matches carry relevant information from the prior models to the scan, including normal data and feature classification, and are used to augment the input point-set. This allows to resolve many ambiguities and difficulties that come up during reconstruction, e.g., distinguishing between signal and noise or between gaps in the data and boundaries of the model. We demonstrate how our algorithm, given suitable prior models, successfully handles noisy and under-sampled point sets, faithfully reconstructing smooth regions as well as sharp features.
Traditional image resizing techniques are oblivious to the content of the image when changing its width or height. In contrast, media (i.e., image and video) retargeting take s content into account. For example, one would like to change the aspect ratio of a video without making human figures look too fat or too skinny, or change the size of an image by automatically removing "unnecessary" portions while keeping the "important" features intact. We propose a simple operator; we term seam carving to support image and video retargeting. A seam is an optimal 1D path of pixels in an image, or a 2D manifold in a video cube, going from top to bottom, or left to right. Optimality is defined by minimizing an energy function that assigns costs to pixels. We show that computing a seam reduces to a dynamic programming problem for images and a graph min-cut search for video. We demonstrate that several image and video operations, such as aspect ratio correction, size change, and object removal, can be recast as a successive operation of the seam carving operator.