Towards Total Scene Understanding:
Classification, Annotation and Segmentation in an Automatic Framework

 

  • Download codes & data
  • Citation: Li-Jia Li, Richard Socher and Li Fei-Fei. Towards Total Scene Understanding:Classification, Annotation and Segmentation in an Automatic Framework. Computer Vision and Pattern Recognition (CVPR) 2009
Abstract
Given an image, we propose a hierarchical generative model that classifies the overall scene, recognizes and segments each object component, as well as annotates the image with a list of tags. To our knowledge, this is the first model that performs all three tasks in one coherent framework. For instance, a scene of a ‘polo game’ consists of several visual objects such as ‘human’, ‘horse’, ‘grass’, etc. In addition, it can be further annotated with a list of more abstract (e.g. ‘dusk’) or visually less salient (e.g. ‘saddle’) tags. Our generative model jointly explains images through a visual model and a textual model. Visually relevant objects are represented by regions and patches, while visually irrelevant textual annotations are influenced directly by the overall scene class. We propose a fully automatic learning framework that is able to learn robust scene models from noisy web data such as images and user tags from Flickr.com. We demonstrate the effectiveness of our framework by automatically classifying, annotating and segmenting images from eight classes depicting sport scenes. In all three tasks, our model significantly outperforms state-of- the-art algorithms.
Coherent Model back to top
Our model can recognize and segment multiple objects as well as classify scenes in one coherent framework.
Automatic Framework back to top
We propose a framework for automatic learning from Internet images and tags (i.e. flickr.com), hence offering a scalable approach with no additional human labor.
Evaluation back to top
Classification back to top
Segmentation & Annotation back to top
Segmentation & Annotation without top-down information
Segmentation & Annotation with top-down information
P(object|scene)

The comparison between the results in the first two columns underscores the effectiveness of the contextual facilitation by the top-down classification on the annotation and segmentation tasks.

Resources back to top

[1] Li-Jia Li, Richard Socher and Li Fei-Fei. Towards Total Scene Understanding:Classification, Annotation and Segmentation in an Automatic Framework. Computer Vision and Pattern Recognition (CVPR) 2009. PDF

Selected References back to top

[2] Li-Jia Li and Li Fei-Fei. What, where and who? Classifying event by scene and object recognition . IEEE Intern. Conf. in Computer Vision (ICCV). 2007. PDF

[3] Li Fei-Fei and Pietro Perona. A Bayesian hierarchy model for learning natural scene categories. Computer Vision and Pattern Recognition (CVPR), 2005.

[4] Liangliang Cao and Li Fei-Fei. Spatially coherent latent topic model for concurrent object segmentation and classification. IEEE Intern. Conf. in Computer Vision (ICCV). 2007.

[5] David Blei and Micheal Jordan. Modeling annotated data. ACM SIGIR Conference on Research and Development in Information Retrieval, 2003.

Last modified April 2009 by Li-Jia Li