BigSFM: Reconstructing the World from Internet Photos
Our group is working on building a 3D model of the world from online photo collections, and our research spans several areas, from image features, to large-scale image matching, to structure-from-motion optimization, to applications such as location recognition. This page summarizes our work, has links to code and datasets we have made available, and has a description of each project.
Many more Internet photo collection SFM datasets on the 1DSfM page
Structure from Motion Disambiguation: see the sfm-disambig project page for a range of SfM problems on datasets with symmetries and other repeated structures.
Synopsis: We present a simple, effective method for solving structure from motion problems by averaging epipolar geometries. Based on recent successes in solving for global camera rotations using averaging schemes, we focus on the problem of solving for 3D camera translations given a network of noisy pairwise camera translation directions (or 3D point observations). To do this well, we have two main insights. First, we propose a method for removing outliers from problem instances by solving simpler low-dimensional subproblems, which we refer to as 1DSfM problems. Second, we present a simple, principled averaging scheme. We demonstrate this new method in the wild on Internet photo collections.
Synopsis: Repeated features are common in urban scenes. Many objects, such as clock towers with nearly identical sides, or domes with strong radial symmetries, pose challenges for structure from motion. When similar but distinct features are mistakenly equated, the resulting 3D reconstructions can have errors ranging from phantom walls and superimposed structures to a complete failure to reconstruct. We present a new approach to solving such local visibility structure of such repeated features. Drawing upon network theory, we present a new way of scoring features using a measure of local clustering. Our model leads to a simple, fast, and highly scalable technique for disambiguating repeated features based on an analysis of an underlying visibility graph, without relying on explicit geometric reasoning. We demonstrate our method on several very large datasets drawn from Internet photo collections, and compare it to a more traditional geometry-based disambiguation technique.
Synopsis: We address the problem of determining where a photo was taken by estimating a full 6-DOF-plus-intrincs camera pose with respect to a large geo-registered 3D point cloud, bringing together research on image localization, landmark recognition, and 3D pose estimation. Our method scales to datasets with hundreds of thousands of images and tens of millions of 3D points through the use of two new techniques: a co-occurrence prior for RANSAC and bidirectional matching of image features with 3D points. We evaluate our method on several large data sets, and show state-of-the-art results on landmark recognition as well as the ability to locate cameras to within meters, requiring only seconds per query.
Synopsis: Many computer vision applications are utilizing large-scale datasets of places derived from the many billions of photos on the Web. Such applications often require knowledge of the visual connectivity structure of these image collections--describing which images overlap or are otherwise related--and an important step in understanding this structure is to identify connected components of this underlying image graph. As the structure of this graph is often initially unknown, this problem can be posed as one of exploring the connectivity between images as quickly as possible, by intelligently selecting a subset of image pairs for feature matching and geometric verification. We propose a novel, scalable algorithm called MatchMiner that efficiently explores visual relations between images, incorporating ideas from relevance feedback to improve decision making over time, as well as a simple yet effective {\em rank distance} measure for detecting outlier images. Using these ideas, our algorithm automatically prioritizes image pairs that can potentially connect or contribute to large connected components, using an information-theoretic algorithm to decide which image pairs to test next.
Synopsis: We present a new technique for extracting local features from images of architectural scenes, based on detecting and representing local symmetries. These features are motivated by the fact that local symmetries, at different scales, are a fundamental characteristic of many urban images, and are potentially more invariant to large appearance changes than lower-level features such as SIFT. Our features are based on simple measures of local bilateral and rotational symmetries, used both for feature detection and for computing descriptors. We demonstrate our method on a challenging new dataset containing image pairs exhibiting a range of dramatic variations in lighting, age, and rendering style, and show that our features can improve matching performance for this difficult task.
Synopsis: We present an new approach to structure from motion (SfM) based on combining discrete and continuous optimization. SfM involves a complex objective function that is difficult to initialize. Our technique first uses powerful discrete optimization techniques (loopy belief propagation) to get a coarse initial solution, then refines this solution using bundle adjustment. We use this method to produce models that are similar to or better than those produced with standard techniques, but more robustly and in a fraction of the time.
Synopsis: Recent work in structure from motion has demonstrated the possibility of reconstructing geometry from large-scale community photo collections. Bundle adjustment, the joint non-linear refinement of camera and point parameters, is a key component of most SfM systems, and one which can consume a significant amount of time for large problems. We consider the design and implementation of a new Inexact Newton type bundle adjustment algorithm, which uses substantially less time and memory than standard Schur complement based methods, without compromising on the quality of the solution. We explore the use of the Conjugate Gradients algorithm for calculating the Newton step and its performance as a function of some simple and computationally efficient preconditioners.
Synopsis: Entering the search term Rome on Flickr returns more than two million photographs. This collection represents an increasingly complete photographic record of the city, capturing every popular site, facade, interior, fountain, sculpture, painting, cafe, and so forth. It also offers us an unprecedented opportunity to richly capture, explore and study the three dimensional shape of the city. We consider the problem of reconstructing entire cities from images harvested from the web. Our aim is to build a parallel distributed system that can match hundreds of thousands of images of a city to find common points, then use this information to compute the 3D structure of the city. All this to be done in a day.
Papers:
Building Rome in a Day [pdf]
Sameer Agarwal, Noah Snavely, Ian Simon, Steven M. Seitz, and Richard Szeliski ICCV 2009
Reconstructing Rome [pdf]
Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Brian Curless, Steven M. Seitz, and Richard Szeliski IEEE Computer, pp. 40-47, June, 2010 (Cover feature)
Building Rome in a Day [CACM page]
Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M. Seitz, and Richard Szeliski Communications of the ACM, Vol. 54, No. 10, Pages 105-112, October 2011.
with a Technical Perspective by Prof. Carlo Tomasi