Recognizing Objects by Simultaneously Combining Appearance and Geometry

 

Daniel Huttenlocher, PI

 

 

Project
Summary

The goal of this project is to achieve a qualitative improvement in the robustness of object category recognition and localization, by formulating the problem as a single overall estimation problem rather than as successive stages of feature detection and object detection. In the proposed approach each object class is modeled as a collection of local patches arranged in a deformable configuration, where certain pairs of parts are connected by spring-like connections. The object localization problem is formulated in terms of energy minimization, where there is a cost for placing each patch at each location in the image, and a cost for placing pairs of patches in a manner that stretches the springs connecting them.

 

This kind of formulation was proposed in the 1970's under the name Pictorial Structures, but was abandoned due to its computational complexity. Recent algorithmic advances have made it possible to pursue this approach as an alternative to traditional methods based on detecting features. The central focus of the proposed work is to use techniques for combining uncertain information to improve the efficiency and accuracy of object recognition. The work aims to improve the ability to detect and localize objects in images. Accurate localization of objects is important for systems that interact with the world, however the focus of most current research activity is on classification methods that are better suited to image retrieval problems. The work also aims to develop methods that take advantage of sources of information beyond a single object. While it is well known that scene-level context can be helpful in improving recognition, interpreting such context often depends on recognizing objects and vice versa. Extensions of the Pictorial Structures approach to scene-level context offer particular promise because the methods are designed to directly combine multiple sources of uncertain information without the need for intermediate detection decisions.

 

Results
this Year

Work in the first year has focused on two aspects of the overall approach.  The first focus is on low-level processing, bringing machine learning techniques and graphical representations from work in object recognition to bear on low-level vision problems.  The goal here is to move beyond hand-tuned models with simple 4-connected grid topologies that are common in low-level vision, and towards richer models that are learned from examples and can be integrated into a single overall framework for object recognition.  These are described in [3,4,5]. The second focus is on high-level object recognition, largely experimental work extending our previous work on Pictorial Structure models [1] and scene context [2] to larger numbers of object categories and the more challenging PASCAL VOC 2007 dataset.  In order to achieve state-of-the-art precision-recall curves for this data, we had to move beyond single-scale models to multiple spatial scales, taking an approach motivated by the recent work of Felzenszwalb et. al. [6] using multi-scale HoG features and a multi-scale extension of the modeling framework we had previously used.  We have also been collaborating with Kodak on applying these kinds of flexible template models to recognition for consumer photographs.

Plans for Next Year

We have begun bringing the low-level learning approaches together with the object category recognition techniques, focusing on learning models for objects composed of HoG-like features that consist of spatially-uniform graphical models defined by the image grid, coupled with Pictorial Structure style part-based flexible template models (again represented as graphical models).  This combination of low- and high-level recognition techniques is the main planned focus for the second year. 

Relevant
Papers

[1] P.F. Felzenszwalb and D.P. Huttenlocher Pictorial Structures for Object Recognition, Intl. Journal of Computer Vision, 61(1), pp. 55-79, January 2005. 
[2] D. Crandall and D.P. Huttenlocher Composite Models of Objects and Scenes for Category Recognition, Proceedings of IEEE CVPR, 2007.

[3] Y. Li and D.P. Huttenlocher.  Learning for Stereo Vision Using the Structured Support Vector Machine, CVPR 2008.
[4] Y. Li and D.P. Huttenlocher.  Sparse Long-Range Random Field and its Application to Image Denoising, to appear in ECCV 2008.
[5] Y. Li and D.P. Huttenlocher. Learning for Optical Flow using Stochastic Optimization, to appear in ECCV 2008.    

Cited
References

[6] P.F. Felzenszwalb, D. McAllester and D. Ramanan.  A Discriminatively Trained, Multiscale, Deformable Part Model. CVPR 2008.

 

Last Updated: July, 2008

Back to Home Page