Pocket Guide

A VLPR2009 Pocket Guide is available. Please click here to download.


Visa Information

For those who need a visa to come to China, we suggest that each of you to apply for a tourist visa. The following visa agency is recommended: http://www.visarite.com.

If it is necessary, you can list Prof. Cheng-Lin Liu as your contact in China.
        Name: Cheng-Lin Liu
     Address: 95 Zhongguancun East Road, Beijing 100190
            Tel: +86-10-6255-8820
        Email: liucl@nlpr.ia.ac.cn


Course Projects

The idea of the course projects is to provide more opportunities for faculty-student interactions as well as more in-depth discussions on specific topics. It is not required to complete a full system and compete with other groups. The projects will be done in a small-group activity setting. In the last afternoon of the summer school (Monday, July 27), students will present their work for the course projects. The following are the lists of course projects.

Project 01

Title: Classify images in a hierarchy with uncertainty
Faculty: Alex Berg

Description: For instance, given an image, determine where it fits in the wordnet hierarchy. The result should specify uncertainty, for instance a picture of an unknown furry animal might be somewhere under the animal part of the hierarchy, but probably not in the reptile part.

This could be addressed either in terms of concrete computer vision features, or in terms of more abstract features, but there should be some mathematical formalism in the approach.
 

Project 02

Title: Image re-ranking
Faculty: Tamara Berg

Description: Given a set of web pages returned for an object query, re-rank the images contained on these pages so that images at the top of the ranking depict the object while those at the bottom of the ranking do not. You should perform your re-ranking using information extracted from the images themselves in combination with information extracted from the surrounding text. Explore various methods for combining word and image information and compare to handicapped versions of your method using only image or only text information. You might think about using external sources of information such as wikipedia, or wordnet.

One source of web data for this project is from the Visual Geometry Group at Oxford: http://www.robots.ox.ac.uk/~vgg/data/mkdb/index.html
 

Project 03

Title: Classification-based Tracking
Faculty: Robert Collins

Description: In this project we view tracking as a foreground-background classification problem. Given an initial frame of video where an object of interest has been indicated by a bounding box, we sample image patches from the object to form a positive training set, and patches from the background region surrounding the object to form a negative training set. Each sampled patch is described by a set of extracted features, e.g. RGB color histograms, oriented gradient histograms, motion/flow features, etc. The positive and negative examples are used to train a classifier to label patches in a new image as either object or background. Applying this classifier to patches in a new image produces a confidence map that can be used to localize the object in the new image, e.g. by mean-shift. After localization, new samples of object and background can be extracted and added to the positive and negative training sets, a new classifier can be learned for use in the next frame, and so on, resulting in a tracker that automatically adapts to changes in appearance of both the object and the background. Unfortunately, naive implementation of this adaptive scheme inevitably results in tracker drift, so mechanisms for avoiding drift will also be explored.
 

Project 04

Title: Putting bounding boxes on objects: a Semi-Supervised
         Approach

Faculty: Li Fei-Fei

Description: Putting bounding boxes on objects of interest in images is a laborious task. This project exploits possible techniques to do it in a semi-supervised way. Here is the setting: For each class of objects, we have a set of hundreds of images (e.g. raccoons in photos). In addition, a small set of these photos (about ~50) already contain bounding boxes on the object of interest (e.g. bounding boxes on raccoons contained in these photos). Can you leverage on this information and complete the bounding box annotation for the rest of the photos that contain raccoons?

3 classes of objects and their images are provided. Each class will contain 500 images, 50 of which are annotated with bounding boxes. The dataset is available here.
 

Project 05

Title: Clustering
Faculty: Tony Jebara

Description: Clustering is an unsupervised algorithm which can potentially separate different classes of objects or observations in a dataset without knowing any labels beforehand. Consider using clustering to solve binary classification tasks where all the classification labels are hidden.

Download 5 UCI classification datasets of your choice from: http://archive.ics.uci.edu/ml/.

If these are not binary classification problems, simply form a two-class binary classification problem by only including the largest two classes. You will use clustering to attempt to separate these two classes. Run the spectral clustering algorithm of Ng, Jordan and Weiss using the radial basis function kernel. Evaluate and plot the resulting classification error rate and also plot the ratio cut, sparse cut and normalized cut scores achieved by the algorithm. For each dataset, show these four performance measures while sweeping different values of the bandwidth in the kernel.
 

Project 06 & 07 & 08

Faculty: Yann LeCun

Background information about these project can be found in this paper: http://yann.lecun.com/exdb/publis/index.html#jarrett-iccv-09.
Kevin Jarrett, Koray Kavukcuoglu, Marc'Aurelio Ranzato and Yann LeCun: "What is the Best Multi-Stage Architecture for Object Recognition?," Proc. International Conference on Computer Vision (ICCV'09), 2009.

Project 06 Title: Learning features with predictive sparse
                         decomposition


Description: Learning features with Predictive Sparse Decomposition: Predictive Sparse Decomposition (PSD) is a method for learning sparse features in an unsupervised manner. The method is described in this paper: http://yann.lecun.com/exdb/publis/pdf/koray-psd-08.pdf. The method minimizes the following energy function:
     L(Y,Z,W) = ||Y-Wd.Z||^2 + k.Sparse(Z) + ||Z-G(We,Y)||^2
where Y is an image patch, Z is the feature vector representing Y, Wd and We are matrices, and G() is a non-linear function (usually a tanh sigmoid function). For a given Y, we find the corresponding feature vector Z by minimizing L with respect to Z. Then, we can learn Wd and We by performing a gradient step to minimize L. (the columns of Wd must be normalized).

Implement PSD, and use the system built for project 06 (below) to test them. The image patches should be 9x9 pixels or 16x16 pixels. The images should be pre-processed with a high-pass filter (replace each pixel by itself minus a weighted average of its neighbors). and (optionally) a local contrast normalization. (divide each pixel by the weights standard deviation of its neighbors).

Project 07 Title: Learning features with denoising autoencoders

Description: "Denoising autoencoder" is a method to train feature extractors. The method is described here:
ICML paper: http://www.iro.umontreal.ca/~vincentp/Publications/vincent_icml_2008.pdf
ICML video: http://videolectures.net/icml08_vincent_ecrf/
more information: http://www.iro.umontreal.ca/~vincentp/publications.html

Train a denoising autoencoder on natural image patches of size 9x9 or 16x16. The images should be pre-processed with a high-pass filter (replace each pixel by itself minus a weighted average of its neighbors) and (optionally) a local contrast normalization. (divide each pixel by the weights standard deviation of its neighbors).

Project 08 Title: Building the simplest object recognizer

Description: As described in "What is the Best Multi-Stage Architecture for Object Recognition?", build a feature extraction system as follows:
   - Use size-normalized object images from one of the standard datasets (e.g. Caltech 101).
   - Preprocess them with a high-pass filter (replace each pixel by itself minus a weighted average of its neighbors) and (optionally) a local contrast normalization (divide each pixel by the weights standard deviation of its neighbors).
   - Apply 64 random filters of size 9x9 or 16x16 over the entire image.
   - Pass the outputs through an absolute value rectification.
   - Perform high-pass filtering and local contrast normalization on the resulting 16 feature maps.
   - Perform spatial pooling and subsampling on each of the 16 feature maps. The subsampling ratio can be 4x4 or so.
   - Apply 64 random filters of size 9x9 or 16x16 to each of the 64 feature maps. This produces 4096 feature maps. Reduce this to 256 feature maps by adding random subsets of 64 of the 4096 feature maps.
   - Pass the outputs through an absolute value rectification.
   - Perform high-pass filtering and local contrast normalization on the resulting 16 feature maps.
   - Perform spatial pooling and subsampling on each of the 16 feature maps the subsampling ratio can be 4x4 or so.
   - Feed the resulting features to a multinomial logistic regression classifier or to a linear SVM. Train this classifier in supervised mode.
This should get between 60 and 65% correct on Caltech-101.
 

Project 09

Title: Symmetry-based saliency detection from un-segmented images
Faculty: Yanxi Liu

Description: Symmetry detection has been a long-standing research topic in computer vision. This project will help you to appreciate the difference between human and machine visual perception of real world symmetries, and the challenges of symmetry detection/learning for computers, which is usually considered trivial and instantaneous for humans.

Find 10 images (or take some photos of your own), ask FIVE people to label all the symmetry parts on these 10 images. Symmetry can include: reflection symmetry with a reflection axis, rotation symmetry with a rotation center and a number of fold (for example, the star on the Chinese flag has a 5-fold rotation symmetry), or even translation symmetry which are periodic patterns such as a façade of a building with two translation generating vectors forming the smallest generating tile of the pattern. Write an algorithm to extract symmetries, you can focus on a SINGLE (reflection, rotation or translation) type of symmetries. Finally, compare the computer output with those labeled by the FIVE human observers.

References and Tools:
  - Data sources on line:
    o Tools for human labeling (both images and interfaces):
       http://vision.cse.psu.edu/SymEva_files/Page406.htm
    o Various types of real images demonstrating real world regularities can
       be found here, you are also encouraged to contribute to this database
       (!): http://vivid.cse.psu.edu/texturedb/gallery/
    o You are also encouraged to use images from publicly available object
       categorization and recognition data sets, e.g. from VOC2009
       http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2009/
  - A survey paper by Professor Liu et al will be distributed before Thursday
  - Performance Evaluation of State-of-the-Art Discrete Symmetry Detection
    Algorithms. (CVPR 2008) Minwoo Park, Seungkyu Lee, James Hays,
    Po-Chun Chen, Somesh Kashyap, Asad Butt and Yanxi Liu.
    http://vision.cse.psu.edu/evaluation.html
  - Curved Glide-Reflection Symmetry Detection (CVPR 2009) Seungkyu
    Lee and Yanxi Liu
 

Project 10

Title: Face image alignment
Faculty: Yi Ma

Description: We will provide you a set of well-aligned face images of a person, taken under different lighting conditions. Now, given a new image of the person, roughly cropped from a picture, please write an algorithm to automatically align the face image with the rest correctly. Your algorithm should work for input images that may have different rotation, or scale than the well-aligned ones.
 

Project 11

Title: Discovering and matching planar structures from images
Faculty: Silvio Savarese

Description: This project aims at automatically discovering and matching local planar regions in complex scenes observed from different viewpoints. Learning techniques based on random sample consensus (RANSAC) are explored for estimating the geometrical transformation connecting observed planar surfaces across views and thus enabling robust matching procedures. The project will also address the issue of detecting and matching multiple planar regions by using iterative techniques such as sequential RANSAC or J-Linkage. For this project will we provide images extracted from a video sequence portraying a complex urban scene comprising different semi-local planar structures such as building facades, sidewalks, and/or advertisement panels.
 

Project 12

Title: Name 10 objects in the picture
Faculty: Jianbo Shi

Description: The name should be category based. We can redefine a (long) name list, or they can come up with their own. The goal is to discover objects, instead of looking for few specific objects.
   version 1. Just put down names.
   version 2. Segment + Name.
 

Project 13

Title: Object counting
Faculty: Jianbo Shi

Description: Given an image with repeated objects appearing multiple times, count them. For example, 4 (chair like object), 2 (table like objects). Note, we don't need to name, we just need to be able to discover repeated 'thing'.
 

Project 14

Title: Shape from texture
Faculty: Sinisa Todorovic

Description: This project will address unsupervised estimation of the 3D shape and orientation of textured surfaces depicted in an image. Shape from texture is an important step toward higher-level image understanding, and thus one of the fundamental problems in computer vision. If a surface is textured, i.e., characterized by a spatial repetition of primitive texture elements (or texels), its 3D properties can be estimated by analyzing the texel shape, size, and placement properties in the image. For example, texels lying along parts of the surface that are far away from the camera will appear smaller in the image than the texels that are closer to the camera. Since the texels are statistically similar to each other, identifying the dominant trend in variations of texel properties (e.g., the gradient of foreshortening) can directly be used for 3D shape estimation. We will focus on images containing multiple textured surfaces.
 

Project 15

Title: Choices of features, classifiers, and representations for non-
         rigid objection

Faculty: Zhuowen Tu

Description: The performance of object detection systems is determined by several key factors: (1) the learning algorithm; (2) the feature set; and (3) the underlying representation. Considerable recent progress has been made on these three aspects. In this project, we will use the INRIA dataset as a test-bed (http://pascal.inrialpes.fr/data/human/) on the pedestrian detection problem.

To test the effectiveness of different classifiers, the students can choose to compare the effectiveness of different discriminative classifiers. A paper with empirical studies on machine learning dataset is at http://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml06.pdf. Some typical ones include (but not limited):
(1) Support Vector Machine (SVM) (a well-documented implementation can be found at http://www.csie.ntu.edu.tw/~cjlin/libsvm/).
(2) Boosting (based on decision-stump and decision-tree week classifier)
(3) K-nearest neighborhood (one can use KD tree for fast retrieval)
(4) random-forest classifier

To test the effectiveness of different features, one can try to use:
(1) the HOG features (http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf)
(2) Haar features (Paul A. Viola, Michael J. Jones: Robust Real-Time Face Detection. IJCV 57(2): 137-154 (2004))
(3) a large number of features including HOG, Haar, and features from different channels

To test the effectiveness of different representations, one can choose to learn object parts for detection:
(1) a latent SVM implementation (http://people.cs.uchicago.edu/~pff/latent/)
(2) a multiple component learning algorithm (http://vision.ucsd.edu/~pdollar/research/papers/DollarEtAlECCV08mcl.pdf)
 

Project 16

Title: Statistics of SIFT/HOG representation for objects
Faculty: Zhuowen Tu

Description: SIFT/HOG features have been widely used in the computer vision literature. Different objects (or different parts in the same object) may observe different texture patterns. It is important to know when and how to use them as a basic object representation.

In the project, the students can choose a set of object classes from e.g. the PASCAL, LHI or LabelMe dataset, and empirically study the manifold of SIFT, as well as its behavior for different object categories in classification. If successfully carried, this project will provide some useful empirical guidelines for the use of SIFT/HOG features, which is somewhat lacking in the literature.
 

Project 17

Title: Object recognition
Faculty: Eric Xing

Description: The Caltech 256 dataset contains images of 256 object categories taken at varying orientations, varying lighting conditions, and with different backgrounds. http://www.vision.caltech.edu/Image_Datasets/Caltech256/.

You can try to create an object recognition system which can identify which object category is the best match for a given test image. Apply clustering to learn object categories without supervision. Here are three ideas you can possibly work on:

1) The "codebook" used in the original CVPR05 paper by FeiFei was generated using a data preprocessing procedure, i.e., clustering the visual elements, and pick the centroids as "codewords". When applied to a generative model, the observed visual elements in a given image is matched to a "codeword" through a hard assignment. You may want to improve the flexibility of the model by allowing a soft assignment so that a given visual element in the data can possibly matched to multiple different "codewords" with uncertainty through a noisy channel. You may also want to eliminate the preprocessing step by learning the codebook jointly. In this project you are asked to design (with help from the instructor) and implement such a model.

2) In the mixture of topic model used in FeiFei's CVPR 05 paper, the building block is an LDA model. Now you are asked to changed it to a mixture of Logistic-Normal topic model. This change does not only enrich the model so that it can now capture correlations between topics, but also it allows a direct upgrade of the original model to a dynamic model that evolves over time, and therefor can be applied to perform tasks such as object tracking and trajectory modeling in video data.

3) In you have a solid understanding of the above two steps, and the nonparametric Bayesian models, you can extend the above models into a semi-parametric model by including a Dirichlet process, or HDP, or a Dynamic CRP prior, so that you can model your data in image/video datasets without pre-specifying the number of image classes, number of codewords, number of objects, number and duration of trajectories, etc. Many of these are open topics in vision and ML in general.

4) You can also try to create a discriminative max-margin topic model based on "J. Zhu, A. Ahmed and E. P. Xing, MedLDA: Maximum Margin Supervised Topic Models for Regression and Classification, The 26th International Conference on Machine Learning (ICML 2009)" for images to boost up its performance on classification, and to learn truly discriminative topics. Again, no earlier attempt has been made along this direction.
 

Project 18 & 19

Faculty: Song-Chun Zhu

In computer vision and pattern recognition, a visual concept, such as a texture pattern, a shape, and an object category, is often defined by a set of instances which could be governed by a probability model. Thus it is of significant interest to study algorithms for (1) simulating the model by drawing typical examples, (2) estimating certain quantities about this model, such as the cardinality of the set, the expectations of certain statistics. The following exercises are two classical problems that should be solved through Markov chain Monte Carlo and importance sampling.

Project 18 Title: Simulating a model

Description: In an NxN lattice with torus boundary condition and 4-nearest neighborhood, for any site s=(i,j), a label x(s) is defined in a set {1, 2, 3, ..., K}. Let's start with K=2 and then generalize to arbitrary K later. This label forms a random field X={ x(s) }. For a pair of neighboring sites (s,t), we define an indicator function 1(x(s)=x(t)), it is equal to 1 if the labels x(s)=x(t) and 0 otherwise.

Now we define a simple pattern in a set, C = { X: E[ 1(x(s) = x(t) ] = h, for all (s,t) } , h is a constant in [0,1] and measures how likely two nearby sites have the same label.

Question:
1, Derive a probability model for this set C.
2, Design an algorithm that can draw fair samples from this set.
3, How to diagnose your sample is a fair sample? This is called exact sampling.
4, Adjust h to locate the critical temperature where the sampler slows down.
5, Develop a cluster sampling algorithm which can draw samples in polynomial time.
6, Plot curves for the convergence, and show typical images at various steps.

Project 19 Title: Estimating the size of a set

Description: Consider a N x N grid, we cosider a self-avoiding-walk (SAW) as a path starting from site [0,0], i.e. the bottom-left corner, it moves to one of its immediate nearest neighbor (4-Nearest neighborhood) provided that neighbor has not been visited before. It stops when all the surrounding neighbors have been visited.

We want to know how many distinct SAW paths exist in a 7x7 or 10x10 grid. Note that this number of very big, you won't be able to enumerate the paths.