Building the World* in Six Days *(As Captured by the Yahoo 100M Image Dataset) (CVPR 2015)


We propose a novel, large-scale, structure-from-motion framework that advances the state of the art in data scalability from city-scale modeling (millions of images) to world-scale modeling (several tens of millions of images) using just a single computer. The main enabling technology is the use of a streaming-based framework for connected component discovery. Moreover, our system employs an adaptive, online, iconic image clustering approach based on an augmented bag-of-words representation, in order to balance the goals of registration, comprehensiveness, and data compactness. We demonstrate our proposal by operating on a recent publicly available 100 million image crowd-sourced photo collection containing images geographically distributed throughout the entire world. Results illustrate that our streaming-based approach does not compromise model completeness, but achieves unprecedented levels of efficiency and scalability.[pdf]

True Internet Scale Photo Collection Reconstruction in less than one day on a single Computer

We introduce an approach for dense 3D reconstruction from unregistered Internet-scale photo collections with about 3 million of images within the span of a day on a single PC (“cloudless”). Our method advances image clustering, stereo, stereo fusion and structure-from-motion to achieve high computational performance. We leverage geometric and appearance constraints to obtain a highly parallel implementation on modern graphics processors and multi-core architectures. This leads to two orders of magnitude higher performance on an order of magnitude larger dataset than competing state-of-the-art approaches. For more details please see the project webpage.

Compromising Reflections

We investigate the implications of the ubiquity of personal mobile devices and reveal new techniques for compromising the privacy of users typing on virtual keyboards. Specifically, we show that so-called compromising reflections (in, for example, a victim’s sunglasses) of a device’s screen are sufficient to enable automated reconstruction, from video, of text typed on a virtual keyboard. Through the use of advanced computer vision and machine learning techniques, we are able to operate under extremely realistic threat models, in real-world operating conditions, which are far beyond the range of more traditional OCR-based attacks. In particular, our system does not require expensive and bulky telescopic lenses: rather, we make use of off-the-shelf, handheld video cameras. In addition, we make no limiting assumptions about the motion of the phone or of the camera, nor the typing style of the user, and are able to reconstruct accurate transcripts of recorded input, even when using footage captured in challenging environments (e.g., on a moving bus). To further underscore the extent of this threat, our system is able to achieve accurate results even at very large distances—up to 61 m for direct surveillance, and 12 m for sunglass reflections. We believe these results highlight the importance of adjusting privacy expectations in response to emerging technologies. project webpage


Real-time Urban 3D Reconstruction

This research aims at developing a system for automatic, geo-registered, real time 3D reconcstruction from vdieo of urban scenes.  From 2005-2007 we developed a system that collects video streams, as well as GPS and inertia measurements in order to place the reconstructed models in geo-registered coordinates.  It is designed using current state of the art real-time modules for all performance.  Our system extends existing algorithms to meet the robustness and variability necessary to operate out of the lab.  To account for the large dynamic range of outdoor videos the processing pipeline estimates global camera gain changes in the feature tracking stage and efficiently compensates for these in stereo estimation without impacting the real-time performance.  The required accuracy for many applications is achieved with a two-step stereo reconstruction process exploiting redundancy across frames.  More details can be found here.  Our real-time stereo estimation code in CUDA can be found here.


3D reconstruction of landmarks from photo collection

This research aims at the 3D modeling of landmark sites such as the Statue of Liberty based on large-scale contaminated image collections gathered from the Internet. Our system combines 2D appearance and 3D geometric constraints to efficiently extract scene summaries, build 3D models, and recognize instances of the landmark in new test images. We start by clustering images using low-dimensional global “gist” descriptors. Next, we perform geometric verification to retain only the clusters whose images share a common 3D structure. Each valid cluster is then represented by a single iconic view, and geometric relationships between iconic views are captured by an iconic scene graph. In addition to serving as a compact scene summary, this graph is used to guide structure from motion to efficiently produce 3D models of the different aspects of the landmark. The set of iconic images is also used for recognition, i.e., determining whether new test images contain the landmark. Results on three data sets consisting of tens of thousands of images demonstrate the potential of the proposed approach. Updated results can be found here

Camera motion estimation from uncalibrated videos

The 3D reconstruction system above uses the camera poses measured by the GPS/INS system. In this effort we want to estimate the same 3D-object representation from uncalibrated image or video sequences. These image sequences can aquired by video cameras, photo cameras or digital cameras without the need of any camera calibration. More details about the reconstructions from uncalibrated video are shown here and the my work on sensor augmented camera calibration is discussed here.

Tracking and calibration for multi-camera systems

An inherent limitation of camera pose estimation from video is that it is not possible to determine the scale of the reconstruction.  The feature is used in special effect animation to create scenes at a much smaller scale for the effect capture.  In 3D reconstruction from video this poses a problem that we address by using multi-camera-systems.  My first work in this area was assuming an overlapping field of view for the cameras of the multi-camera systems.  The newer work removed this requirement so that the cameras now do not have to overlap in view.  Using the multi-camera systems we were able to show that the scene scale can be recovered.  The paper can be found here.

GPU acclerated 2D tracking and matching

Computing the 2D correspondences between the consecutive frames of a video or between images is a fundamental step in camera pose estimation.  We weork on this problem in multiple ways.  We implemented the well known KLT tracker on GPU which provides a speedup of a factor of more than 20 and real-time processing of more than 30Hz is achieved.  The code is publicly available on Sudiptas web page.  If you want to register images taken by a camera often more robust matching is required.  Often SIFT matching is used in these cases.  Since SIFT matching is computationally expensive we used the GPU to improve the processing time to 10Hz.  The code for the GPU based SIFT detector can be found on Changchang’s web page.  In addition to accelerating 2D tracking we also improved the robustness of the KLT tracking by combining the camera response function estimation and the tracking.  The details can be found here.

Marker-less Augmented Reality

The subject of augmented reality is to insert virtual objects into real scenes.  We developed a system for high quality marker-less augmented reality with realistic direct illumination of the virtual objects.  The lights of the scene are localized and are used for direct illumination of the virtual object placed in the scene.  Our method keeps the augmented scene unaffected to overcome the limitations of many systems, which require markers or additional equipment in the scene to reconstruct illumination.  For more details look here.

Differential Camera Tracking

Camera based tracking methods typically rely on a constant scene appearance over all view points. This fundamental assumption is violated for complex environments containing reflections and semi-transparent surfaces. We developed a passive optical camera tracking can be done in these very complex environments which include curved mirrors and semi-transparency. More details can be found here.

Synthetic Illumination of Real Object

We develop a system to capture the illumination from a virtual environment map.  However, the system has the drawback of a complex calibration procedure that is limited to planar screens.  We propose a simple calibration procedure using a reflective calibration object that is able to deal with arbitrary screen geometries.  Our calibration procedure is not limited to our application and can be used to calibrate most camera projector systems.

Camera based Natural User Interaction

Computer vision provides a powerful tool to track users in interactive environments with cameras. This has the advantages of enabling a natural interaction for the users without the need to mount any senors on the user. We developed a system to use cameras for the view control in a virtual reality system. More details can be found here.


We thank our sponsors for their support in the various projects: