Recognizing Objects by Simultaneously Combining Appearance and Geometry

 

Daniel Huttenlocher, PI

(Completed)

 

 

Project
Summary

This project investigates methods that formulate the object recognition problem as a single overall optimization rather than as successive stages of feature detection and matching.  Such feature matching approaches have predominated throughout the history of research in object recognition, and are particularly prevalent in recent work. In contrast, the approach taken here combines bottom-up information about the appearance of local image patches with top-down information about geometric relations between those patches.  The main focus is on recognizing generic classes of objects such as bicycles, people, motorbikes, or cars.  Each object class is modeled as a collection of parts arranged in a deformable configuration, where certain pairs of parts are connected by springs.  Recognition is formulated in terms of energy minimization, where there is a cost for placing each patch at each possible location in the image, and a cost for placing pairs of patches in a manner that stretches the springs connecting them.

Such an energy minimization formulation was proposed in the 1970's under the name Pictorial Structures, but was abandoned due to its computational complexity.  Recent algorithmic advances have made it possible to further investigate this kind of approach.  Initial results on detecting and localizing objects have been promising, but also demonstrate how much remains to be done for this approach to form a viable alternative to feature-based object recognition.  This project investigates some of the key initial questions in determining whether the energy minimization approach to object recognition could be a viable alternative to current feature-based approaches, including how to learn such models with minimal supervision, and how to incorporate global geometric information such as object scale and orientation into the models.

The approach is based on computing cost maps that determine how well each part matches at each possible location in the image.  These cost maps are then combined together in the energy minimization process.  In contrast, traditional feature detection approaches find a small number of locations where each feature or part might be present in the image. While the sparse nature of feature locations may seem to require less computation than working with entire cost maps, the necessity of handling spurious and missed feature detections in fact makes such feature-based methods quite computationally intensive.

 

Results

The main research focus was on developing weakly supervised learning techniques for the k-fan models introduced in our CVPR 2005 paper. In this training paradigm the only annotation required for learning is the category labels for the object(s) in the training images. In the case of multiple objects per training image, coarse location information is need to specify which region of the image corresponds to which object category. We have developed an approach that builds weak initial models and then improves those models using EM (Expectation Maximization). The initial models are formed by randomly selecting patches from training images and then building pairwise models, composed of two patches, that are correlated highly with a given object category. Those pairwise models are then combined into an overall spatial model using a greedy search procedure to form the initial spatial model. This work is described in part in papers in ECCV 2006 and CVPR 2007.  The latter paper evaluates our method using four object classes from the PASCAL 2006 Visual Object Challenge (VOC) dataset. This is a challenging dataset, with a wide range of viewpoints, object scales, and scene complexity (e.g., multiple objects or partial occlusion). For the four categories of manmade objects: bicycles, motorbikes, cars and buses; our method achieves the best object localization performance of any technique. Moreover, other techniques tend to work well for one or two categories but not all four. One key aspect of this investigation is the use of object category models that include parts corresponding to immediately surrounding scene background. For instance, a car model automatically learns parts corresponding to the road underneath. These scene background or local context parts provide a statistically significant increase in performance, raising the accuracy of our method higher than others on this dataset. Beyond what is reported in that paper, we have been experimenting with richer part models, particularly those suited to detecting natural objects such as pedestrians and animals.

 

Relevant
Papers

  

Talks

ยท         Object Recognition Without Feature Detection
University of Edinburgh, June 2007
Microsoft Research Cambridge, June 2007
INRIA Rhone-Alpes, June 2007
University of Oxford, May 2007
Kodak Inc., May 2007
Imaging Science Seminar, Rochester Institute of Technology, February 2007
Tech Talk, Google Inc., January 2007
Columbia University, December 2006
Computer Science and Engineering Colloquium, UC San Diego, October 2006
General Electric Corporate Research, August 2006
 

 

Last Updated: December, 2007

Back to Home Page