10:15 - 11:15 - May 20 (Thursday)
Session chair: Bob Fisher, University of Edinburgh
Accurate, dense 3D reconstruction is an important requirement in many applications, and stereo represents a viable alternative to active sensors. However, top-ranked stereo algorithms rely on iterative 2D disparity optimization methods for energy minimization that are not well suited to the fast and/or hardware implementation often required in practice. An exception is represented by the approaches that perform disparity optimization in one dimension (1D) by means of scanline optimization (SO) or dynamic programming (DP). Recent SO/DP-based approaches aim to avoid the well known streaking effect by enforcing vertical consistency between scanlines deploying aggregated costs, aggregating multiple scanlines, or performing energy minimization on a tree. In this paper we show that the accuracy of two fast SO/DP-based approaches can be dramatically improved by exploiting a non-iterative methodology that, by modeling the coherence within neighboring points, enforces the local consistency of disparity fields. Our proposal allows us to obtain top-ranked results on the standard Middlebury dataset and, thanks to its computational structure and its reduced memory requirements, is potentially suited to fast and/or hardware implementations.
We propose a new parameter-free method for detecting planar patches in disparity maps. We first introduce an a contrario decision criterion which may be used to solve two decision problems on configurations of 3D points: (i) is the configuration well explained by a plane?; (ii) what is the optimal number of planes that best explains the configuration? These decision criteria are the core of an algorithm that searches for an optimal explanation of a disparity map by planar patches whenever applicable. This method may be used for 3D reconstruction of urban environments, particularly in the context of low-baseline stereo where precision requirements are most strict, and a pertinent choice of the type and amount of regularization is key to achieving accurate results.
Using shadowgrams for 3D reconstruction is inherently ambiguous. It has be shown that this ambiguity has 4 parameters. We show that, when the light spots are visible to the camera, the number of parameters drops to one.
Current 3D video applications require the availability of depth information, that can be acquired real-time by stereo vision systems and ToF cameras. In this paper, a heterogeneous acquisition system is considered, made of two high resolution standard cameras (stereo pair) and one ToF camera. The stereo system and the ToF camera must be properly calibrated together in order to operate jointly. Therefore this work introduces first a generalized multi-camera calibration technique which does not exploit only the luminance (color) information, but also the depth information extracted by the ToF camera. A probabilistic algorithm is then derived in order to obtain high quality depth information from the information of both the ToF camera and the stereo-pair. Experimental results show that the proposed calibration algorithm leads to a very accurate calibration suitable for the fusion algorithm, that allows for precise extraction of the depth information.
We present a new formulation to the well known problem of 3D shape-from-texture from a single image, but one which is able to handle uncalibrated perspective cameras. Contrary to previous methods, we cast the task as multi-plane based camera pose estimation whereby the information provided by a textured surface makes it possible to perform shape-from-texture and camera focal length estimation jointly. We show that by approximating global perspective by local scaled orthography (which holds often in practical cases) we can acquire depth, surface orientation and focal length from a single image in closed form. This advances state-of-the-art where a calibrated camera is nearly always assumed in order to compute 3D shape from a single image.
The multispectral acquisition of frescoes poses unsolved challenges, the main difficulty being that it is often impos- sible to measure the reference white signal. We propose a statistical method to estimate the illumination directly from the color signal, based on a modification of the RANSAC al- gorithm. We apply our method to the estimation of the light- ing field of three paintings by contemporary Italian artists and of a fresco of the Castello del Buonconsiglio in Trento (Italy), for which a ground truth was available. Quanti- tative results show that the performance of our method is good in terms of the relative mean error on illumination and reflectance, while the maximum errors are sometimes significant.
This paper describes an incremental, real-time implementation of J-linkage, a procedure that can detect multiple instances of a model from data corrupted by noise and outliers. It works in real-time, thanks to several approximations that have been introduced to get around the quadratic complexity of the original algorithm.
This paper introduces a fast approach for automatic dense large-scale 3D urban reconstruction from video. The presented system uses a novel multi-view depthmap fusion algorithm where the surface is represented by a heightmap. While this model seems to be a more natural fit to aerial and satellite data, we have found it to also be a powerful representation for ground-level reconstructions. It has the advantage of producing purely vertical facades, and it also yields a continuous surface without holes. Compared to more general 3D reconstruction methods, our algorithm is more efficient, uses less memory, and produces more compact models at the expense of losing some detail. Our GPU implementation can compute a 200x200 heightmap from 64 depthmaps in just 92 milliseconds. We demonstrate our system on a variety of challenging ground-level datasets including large buildings, residential houses, and store front facades obtaining clean, complete, compact, and visually pleasing 3D models.
We present a 3D scanning system using color encoded structured light, which is able to reconstruct a surface using only a single image. The work is focused on exploiting the captured data and reconstructing as many points as possible. Therefore a multi-stage method is presented to reconstruct the surface from the captured pattern. It consists of i) a robust edge detection step, ii) a color decoding using feedback from previous stripes and iii) a propagation step to detect errors and propagate detected stripes. Using feedback from neighboring stripes and propagation along the stripe direction during color detection and sequence decoding, the system is able to recover areas where regular approaches fail, such as small bridges in geometry, and detect and correct false classifications. Depending on surface complexity we were able to achieve a significant increase in the amount of reconstructable points.
Active triangulation is a well established technique for collecting range points. This work performs a photometric analysis of relative irradiance expected at the camera sensor as a result of intended operating conditions and device specifications including laser power. The limiting effects of eye safety compliance, minimum realizable shutter times and pixel bit depth for linear response cameras are considered. Quantitative results are established determining dynamic range requirements on the camera, when exposure control is needed, and when laser return can be expected to produce the brightest pixels in the image.
Efficient methods for extracting urban scene structures from 3D data is important when dealing with the high volume data collected from mobile terrestrial LiDAR. Rather than searching for primitive shapes directly in the raw 3D data, we demonstrate that the road can be used as a cue for effectively localizing urban scene structures. Road extraction is done by dividing the road into many small sections, and Kalman filtering is then used to track changes of road parameters using a dynamic model. By limiting the search space along the extracted road and using dimensional constraints, near-road structures such as posts and power line are easily segmented in an efficient manner. The algorithm performs consistently well on many different city scenes, with roads segmented accurately in an efficient manner and posts extracted even in the presence of other objects such as cars and trees.
This paper describes a new 3D surface segmentation algorithm that separates a closed surfaces into regions by computing surface contours that traverse surface ridge structures. We refer to the approach as ridge-walking since these contours tend to follow convex ridge-like structures and/or concave valley-like structures present within the geometry of the model. Segmentation is achieved by solving for closed ridge contours on the surface, each of which serves to divide the surface into two disjoint regions. Results for three different segmentation approaches based on this approach are compared: (1) concave ridge walking, (2) convex ridge walking and (3) mixed concave/convex ridge walking. We also compare our results with other leading segmentation methods on standard data sets as well as new datasets that provide important and interesting new challenges.
In this paper we present a conformal mapping-based approach for 3D face recognition. The proposed approach makes use of conformal UV parameterization for mapping purpose and Shape Index decomposition for similarity measurement. Indeed, according to conformal geometry theory, each 3D surface with disk topology can be mapped onto a 2D domain through a global optimization, resulting in a diffeomorphism, i.e., one-to-one and onto. This allows us to reduce the 3D surface matching problem to a 2D image matching one by comparing the corresponding 2D conformal geometric maps. To deal with facial expressions, the Mobius transformation of UV conformal space has been used to 'compress' face mimic region. Rasterized images are used as an input for (2D)2PCA recognition algorithm.
The 3D shape of the face has been shown to be a viable and robust biometric for security applications. Many state of the art techniques use Iterative Closest Point (ICP) for 3D face matching. We propose and explore several optimizations of the ICP-based matching technique relating to the processing of multiple regions and the fusion of region matching scores obtained from ICP alignment. The optimizations explored included: (i) the symmetric use of probe and gallery face regions as ICP’s model and data shapes, enabling score fusion; (ii) gallery and probe region matching score normalization; (iii) region selection based on face data centroid rather than the nose tip, and (iv) region weighting. As a result of these optimizations, the rank-one recognition rate for a canonical matching experiment improved from 96.4% to 98.6%, and the True Accept Rate (TAR) at 0.1% False Accept Rate (FAR) improved from 90.4% to 98.5%.
Automated 3D modeling of building interiors is useful in applications such as virtual reality and entertainment. Using a human-operated backpack system equipped with 2D laser scanners and inertial measurement units, we develop four scan-matching-based algorithms to localize the backpack and compare their performance and tradeoffs. We present results for two datasets of a 30-meter-long indoor hallway and compare one of the best performing localization algorithms with a visual-odometry-based method. We find that our scan-matching-based approach results in comparable or higher accuracy.
In video surveillance systems, when dealing with dynamic complex scenes, processing the information coming from multiple cameras and fusing them into a comprehensible environment is a challenging task. This work addresses the issue of providing a global and reliable representation of the monitored environment aiming at enhancing the perception and minimizing the operator's effort. The proposed system Virtu4D is based on 3D computer vision and virtual reality techniques and takes benefit from both the "real" and the "virtual" worlds offering a unique perception of the scene. This paper presents a short overview of the framework along with the different components of the design space: Video Model Layout, Video Processing and Immersive Model Generation. The final interface gathers the 2D information in the 3D context but also offers a complete 3D representation of the dynamic environment allowing a free intuitive 3D navigation.
We present a technique for storing the sparse data that often occurs when processing three dimensional medical images. The technique uses raster scan order to store the one dimensional volume indexes of each pixel location, and stores an inverted copy of these indexes for fast lookup. The inverted index is stored as a Judy array which is shown to be highly efficient in lookup times while using very little memory compared to hash tables. We demonstrate the efficiency of the data structure by performing partial volume segmentation and digital removal of oral contrast agent within CT Colonography (CTC).