High efficiency and accuracy of semi-global matching (SGM) make it widely used in many stereo vision applications. However, SGM not only struggles in dealing with pixels in homogeneous area, but also suffers from streak artifacts. In this paper, we propose a novel omni-directional SGM (OmniSGM) with a cost volume update scheme to aggregate costs from paths along all directions and to encourage reliable information to propagate across entire image. Specifically, we perform SGM along four tree structures, namely trees in the left, right, top and bottom of root node, and then fuse the outputs to obtain final result. The contributions of pixels on each tree can be recursively computed from leaf nodes to root node, ensuring our method has linear time computational complexity. Moreover, An iterative cost volume update scheme is proposed using aggregated cost in the last pass to enhance the robustness of initial matching cost. Thus, useful information is more likely to propagate in a long distance to handle the ambiguities in low textural area. Finally, we present an efficient strategy to propagate disparities of stable pixels along the minimum spanning tree (MST) for disparity refinement. Extensive experiments in stereo matching on Middlebury and KITTI datasets demonstrate that our method outperforms typical traditional SGM-based cost aggregation methods.
The computation of the disparity for the pixels in the weak texture area has always been a difficult task in stereo vision. The non-local method based on a minimum spanning tree (MST) provides a solution to construct content-adaptive support regions to perform cost aggregation. However, it always introduces error disparity in slanted surfaces and is sensitive to noise and highly textural regions. The window-based methods are not effective for information dissemination. To overcome the problem mentioned above, this paper proposes an approximate geodesic distance tree filter, which utilizes geodesic distance as a pixels similarity metric and recursive techniques to perform the filtering process. The filtering process is performed recursively in four directions (namely from top-left, top-right, and vice versa), which make our filter with linear complexity. Our filter has advantages in the sense that: (1) the pixel similarity metric is approximated geodesic distance; (2) the computational complexity is linear to the image pixel. Due to these reasons, the proposed method can properly cope with cost aggregation in the textureless regions and preserve the boundary of disparity maps. We demonstrate the strength of our filter in several applications.
Camera calibration plays a crucial role in 3D measurement tasks of machine vision. In typical calibration processes, camera parameters are iteratively optimized in the forward imaging process (FIP). However, the results can only guarantee the minimum of 2D projection errors on the image plane, but not the minimum of 3D reconstruction errors. In this paper, we propose a universal method for camera calibration, which uses the back projection process (BPP). In our method, a forward projection model is used to obtain initial intrinsic and extrinsic parameters with a popular planar checkerboard pattern. Then, the extracted image points are projected back into 3D space and compared with the ideal point coordinates. Finally, the estimation of the camera parameters is refined by a non-linear function minimization process. The proposed method can obtain a more accurate calibration result, which is more physically useful. Simulation and practical data are given to demonstrate the accuracy of the proposed method.
Abstract The desired solution to many labelling problems in computer vision is a spatially smooth result where label changes are aligned with the edges of the guidance image. It can be obtained traditionally by smoothing the label costs using edge‐aware filters. However, local filters incorporate the information in a local support region to obtain locally‐optimal and non‐local tree‐based filters, which often overuse piece‐wise constant assumptions. In this paper, we propose a spatial‐tree filter for cost aggregation. The tree model incorporates the spatial affinity into the tree structure. The tree distance between two pixels on our spatial tree is an approximated geodesic distance, which acts as a pixel similarity metric. The filtering process was implemented by recursively techniques in four directions: Top‐to‐bottom, left‐to‐right, and vice‐versa. Thus, the complexity of our approach is linear to the number of image pixels. Extensive experiments demonstrate the effectiveness and efficiency of our spatial‐tree filter in image smoothing and stereo matching. Our method performs better than the existing tree‐based non‐local method in cost aggregation.
Self-supervised learning has shown very promising results for monocular depth estimation. Scene structure and local details both are significant clues for high-quality depth estimation. Recent works suffer from the lack of explicit modeling of scene structure and proper handling of details information, which leads to a performance bottleneck and blurry artefacts in predicted results. In this paper, we propose the Channel-wise Attention-based Depth Estimation Network (CADepth-Net) with two effective contributions: 1) The structure perception module employs the self-attention mechanism to capture long-range dependencies and aggregates discriminative features in channel dimensions, explicitly enhances the perception of scene structure, obtains the better scene understanding and rich feature representation. 2) The detail emphasis module re-calibrates channel-wise feature maps and selectively emphasizes the informative features, aiming to highlight crucial local details information and fuse different level features more efficiently, resulting in more precise and sharper depth prediction. Furthermore, the extensive experiments validate the effectiveness of our method and show that our model achieves the state-of-the-art results on the KITTI benchmark and Make3D datasets.
The accuracy and speed of semi-global matching (SGM) make it widely used in many computer vision problems. However, SGM often struggles in dealing with pixels in the homogeneous regions and also suffers from streak artefacts for weak smoothness constraints. Meanwhile, we observe that the global method usually fails in occluded areas. The disparities for occluded pixels are typically the average of the disparity of nearby pixels. The local method can propagate the information into occluded pixels with a similar color. In this paper, we propose a novel, to the best of our knowledge, four-direction global matching with a cost volume update scheme to cope with textureless regions and occlusion. The proposed method makes two changes in the recursive formula: a) the computation process considers four visited nodes to enforce more smooth constraints; b) the recursive formula integrates cost filtering to propagate reliable information farther in nontextured regions. Thus, our method can inherit the speed of SGM, properly avoid streaking artefacts, and deal with the occluded pixel. Extensive experiments in stereo matching on Middlebury demonstrate that our method outperforms typical SGM-based cost aggregation approaches and other state-of-the-art local methods.
Self-supervised learning has shown very promising results for monocular depth estimation. Scene structure and local details both are significant clues for high-quality depth estimation. Recent works suffer from the lack of explicit modeling of scene structure and proper handling of details information, which leads to a performance bottleneck and blurry artefacts in predicted results. In this paper, we propose the Channel-wise Attention-based Depth Estimation Network (CADepth-Net) with two effective contributions: 1) The structure perception module employs the self-attention mechanism to capture long-range dependencies and aggregates discriminative features in channel dimensions, explicitly enhances the perception of scene structure, obtains the better scene understanding and rich feature representation. 2) The detail emphasis module re-calibrates channel-wise feature maps and selectively emphasizes the informative features, aiming to highlight crucial local details information and fuse different level features more efficiently, resulting in more precise and sharper depth prediction. Furthermore, the extensive experiments validate the effectiveness of our method and show that our model achieves the state-of-the-art results on the KITTI benchmark and Make3D datasets.
Unsupervised domain adaptation aims to transfer knowledge from a source domain to a target domain so that the target domain data can be recognized without any explicit labelling information for this domain. One limitation of the problem setting is that testing data, despite having no labels, from the target domain is needed during training, which prevents the trained model being directly applied to classify unseen test instances. We formulate a new cross-domain classification problem arising from real-world scenarios where labelled data is available for a subset of classes (known classes) in the target domain, and we expect to recognize new samples belonging to any class (known and unseen classes) once the model is learned. This is a generalized zero-shot learning problem where the side information comes from the source domain in the form of labelled samples instead of class-level semantic representations commonly used in traditional zero-shot learning. We present a unified domain adaptation framework for both unsupervised and zero-shot learning conditions. Our approach learns a joint subspace from source and target domains so that the projections of both data in the subspace can be domain invariant and easily separable. We use the supervised locality preserving projection (SLPP) as the enabling technique and conduct experiments under both unsupervised and zero-shot learning conditions, achieving state-of-the-art results on three domain adaptation benchmark datasets: Office-Caltech, Office31 and Office-Home.
High-accuracy 3D measurement based on binocular vision system is heavily dependent on the accurate calibration of two rigidly-fixed cameras. In most traditional calibration methods, stereo parameters are iteratively optimized through the forward imaging process (FIP). However, the results can only guarantee the minimal 2D pixel errors, but not the minimal 3D reconstruction errors. To address this problem, a simple method to calibrate a stereo rig based on the backward projection process (BPP) is proposed. The position of a spatial point can be determined separately from each camera by planar constraints provided by the planar pattern target. Then combined with pre-defined spatial points, intrinsic and extrinsic parameters of the stereo-rig can be optimized by minimizing the total 3D errors of both left and right cameras. An extensive performance study for the method in the presence of image noise and lens distortions is implemented. Experiments conducted on synthetic and real data demonstrate the accuracy and robustness of the proposed method.