Stanford University CS231b: The Cutting Edge of Computer Vision

[General Policy] [Detection Project] [Segmentation Project] [Classification Project]

We provide skeleton codes in Matlab for each project. You can start your implementation from the code provided by us.
You may be able to find some codes online implemented by the authors. It is definitely OK and encouraged to check these codes. But it is an honor code violation to directly copy codes from existing implementations. Note that your code will be read by the course instructors.
We allow for both 1-person or 2-person teams for the projects. Each team hands in one copy of the codes and writeup. Grading will be fair regardless of the team size.
Your project will be evaluated based on many factors, such as your understanding of the algorithm, performance of your algorithm, quality of your code, your creativity, write-up, and presentation.
It is estimated to take you 20-30 hours for each project, depending on your familiarity with the algorithm and how far you want to go for the projects. The third project might take you 10 hours more.
You are not expected to implement all the details of the algorithm or really achieve state-of-the-art performance. But if your algorithm works very well, you will get extra credits.

Grading Policy per Project:

Technical approach and code: 35% (including correctness and completeness of your method, your innovation, etc)
Experiment evaluation: 35% (including performance and results of your method, thoroughness of your experiments, insights and analysis that you draw from your results)
Write-up quality: 20% (please read the write-up sample carefully)
Project presentation: 10% (clarity and content of your project presentation)

Late Days:

We allow seven late days in total for all the three projects. Once you have used up these late days, the project turned in late will be penalized 20% per late day.

Project 1: Pedestrian Detection with the Deformable Part Model

Project Description:

Implement a pedestrian detection algorithm using the HOG (Histogram of Oriented Gradients) feature representation and the Deformable Part Model.

Refer to [Viola & Jones, 2001] for a general idea about object detection systems.
Refer to [Dalal & Triggs, 2004] for the HOG (Histogram of Oriented Gradients) feature representation.
Refer to [Felzenszwalb et al, 2008] for the Deformable Part Model for object detection. Felzenszwalb et al (PAMI 2010) is a more detailed version of the paper.
Refer to [Felzenszwalb et al, 2010] for how to make your algorithm faster. (We encourage you to read this paper to get more ideas about how to improve a detection system. But you are not required to implement this paper.)

Dataset and Project Setup:

In this project, we will use PASCAL VOC 2007 person detection dataset to train and test your program. Here is a brief introduction of the datasets (you only need to look at the "person" category). The performance of your method will be evaluated using precision & recall curve and average precision (AP). Here is a criteria to evaluate whether a specific detection result is correct or not.
We will provide code to evaluate the performance of your algorithm as part of the starter code.

Please follow these steps to start your project:

Download the starter code here (7MB). Extract the file.
Download the complete VOC2007 dataset here (878MB) and extract it into the 'VOCdevkit' folder. We also provide a smaller subset of the VOC dataset for some quick prototyping (DO NOT REPORT ANY RESULT ON THIS SUBSET!).
The starter code includes a fast HOG feature implementation, learning and inference code for the root feature (including SVM training) in matlab. Run "compile.m" from withing Matlab and make from your favorite terminal to compile the HOG and SVM code.
Run "pascal('person',2)" from within Matlab to train and evaluate the detector.
The starter code should give you an AP of 0.126, which serves as a baseline (the reference implementation has an AP of 0.362).

What is included in the starter code:

A fast implementation of HOG.
Training code for the root filter including the SVM ("Root Filter Initialization" in [Felzenszwalb et al, 2008]).
Detection code for the root filter.
Evaluation code for the VOC 2007 dataset.

What you need to implement:

Training the latent position of the root filter ("Root Filter Update" in [Felzenszwalb et al, 2008]).
Part filter initialization (see [Felzenszwalb et al, 2008]).
Iterative training of the part filters ("Model update" in [Felzenszwalb et al, 2008]).
Detection using the part filters.

More Guides

The authors of the deformable part model have the code online:http://people.cs.uchicago.edu/~pff/latent/. You can read the code before you start your own implementation. But the authors' code has many tricks that are not fully covered by their paper, and you do not need to worry if you cannot fully understand their code.
Implement your method based on [Felzenszwalb et al, 2008] but you do not need to implement the mixture model and dynamic programming for updating deformable parts. You can refer to [Felzenszwalb et al, 2010] to have a better understanding of the method, but you do not need to implement the additional details in this paper.
Directly copying the authors' code without mentioning it in your write-up is an honor code violation. But if you do have trouble in implementing a specific function, you might refer to the existing codes and mention this clearly in your write-up. Note that implementing a function by yourself but have poor performance is more desirable than using existing codes.
There might be some very time-consuming parts in the method. You may use mex-files which allows you to call C functions in matlab so that your algorithm can be accelerated.
If your machine has limited memory or your code is extremely slow, you do not need to use all training and testing images. You can use a subset of the VOC dataset (with a minimum of 1000 training and 1000 test images). But it is encouraged to use all training (2501) and testing (4952) images in your experiments.
If you have any questions or trouble, feel free to ask questions in Piazza or send emails to the course staff email "cs231b-spr1213-staff [at] lists [dot] stanford [dot] edu".
Although you are basically implementing an existing algorithm, the project is very open and you can do everything you can imagine to achieve good performance. You do not need to worry too much if your algorithm is not doing a perfect job. But we do encourage you to start your projects earlier so that you have more time to play with your algorithm.
Your project will not be evaluated based only on the performance of your algorithm. Show us that you have a good understanding of the problem and the algorithm, and try to have deep insights from your experiment results.

References:

P.Viola and M.Jones. Rapid Object Detection using a Boosted Cascade of Simple Features. CVPR 2001.
N.Dalal and B.Triggs. Histograms of Oriented Gradients for Human Detection. CVPR 2005.
P.Felzenszwalb, D.McAllester, and D.Ramanan. A Discriminatively Trained, Multiscale, Deformable Part Model. CVPR 2008.
P.Felzenszwalb, R.Girshick, D. McAllester, and D.Ramanan. Object Detection with Discriminatively Trained Part-Based Models. PAMI 2010.
P.Felzenszwalb, R.Girshick, and D.McAllester. Cascade Object Detection with Deformable Part Models. CVPR 2010.

Important Dates:

A short presentation of your work: Tue, Apr 23 (in class)
Deadline of submitting the code and write-up: Wed, Apr 24 (5pm), how to submit your documents will be updated later

You need to have your implementation and evaluation ready for you presentation (Apr 23). You have an additional day to update your writeup based on the feedback you get from class.

Project 2: Interactive Image Segmentation with GrabCut

Project Description:

Implement GrabCut, an image segmentation algorithm that is now being used in Office 2010. Refer to Rother et al for a detailed description of the algorithm.

Refer to Sec.2 of the Rother paper and Boykov & Jolly for details of energy minimization based image segmentation.

What you need to implement: You only need to implement the iterative procedure described in Sec.3.1 and Sec.3.2 of the Rother paper to get a binary segmentation of the image. The user interaction interface in Sec.3.3 is optional. You do not need to implement border matting or foreground estimation in Sec.4.

Initialization code: You can start your implementation from the code here. Using this code, you can initialize the segmentation by drawing a bounding box that contains the foreground.

Max-flow/min-cut code: You can download the max-flow/min-cut energy minimization code from here. The code is written in C instead of Matlab. Therefore for this project, you have two options. One is to implment your code in C. The other is to implement in Matlab and use mex file to call the max-flow/min-cut function. Here is a brief introduction of mex files.

Please go through the simple example in the README file before you use the code.

"g->add_node()" adds a node to the graph. For an image, each pixel is a node. You need to add all the pixels to the graph.

"g->add_tweights(n, w1, w2)" is the "unary" term of the energy. It represents that if we classify node n as the 1st class, the unary energy for this node will be w1; otherwise the energy will be w2. We need to add this term for each node (pixel). In GrabCut, w1 and w2 are obtained from Mixture of Gaussian.

"g->add_edge(n1, n2, w1, w2)" is the "pairwise" term of the energy. It represents if there is an edge from n1 to n2, the energy for this edge will be w1; if the edge is from n2 to n1, the energy will be w2. Note that if n1 and n2 are disconnected, the energy will be 0. The edges in our graph is undirected, thus we always have w1=w2. In GrabuCut, we need to add this term for all the pairs of nodes (pixels) that are in the 4 or 8-neighborhood of each other.

In the GrabCut, you need to build an algorithm which iteratively updates the mixture of Gaussian distribution for unary terms and use the max-flow/min-cut algorithm for learning segmentation results.

You can read Szeliski et al for more about energy minimization methods in computer vision if you are interested.

Dataset and Experiment:

Images: Download the images that you need to segment from here. Each image contains one foreground object.

Ground Truth: Download the segmentation ground truth images from here. White and black colors indicate foreground and background respectively.

Optional Bounding Boxes: We provide you with a set of bounding boxes of the authors of "Image segmentation with a Bounding Box Prior" here. You can also use your own, if you choose so.

Evaluation metric: The number of pixels that are correctly labled divided by the total number of pixels. You need to compute this value on six images: four images that your algorithm perform very well and two images that your algorithm do not perform very well (if there is any).

Analytical study (optional but good to have some): You can change a number of parameters and compare the results. For example:

The number of iterations of GMM updating and energy minimization.

The number of mixture models in your GMM.

4-neighborhood or 8-neighborhood in your pairwise term.

A tight initial bounding box or a loose bounding box.

For this project, we expect state-of-the art results (at least on some images).

References:

Y.Boykov and M.Jolly. Interactive Graph Cuts for Optimal Boundary and Region Segmentation of Objects in N-D Images. ICCV 2001.

C.Rother, V.Kolmogorov, and A.Blake. "GrabCut" - Interactive Foreground Extraction using Iterated Graph Cuts. SIGGRAPH 2004.

R.Szeliski, R.Zabih, D.Scharstein, O.Veksler, V.Kolmogorov, A.Agarwala, M.Tappen, and C.Rother. A Comparative Study of Energy Minimization Methods for Markov Random Fields with Smoothness-Based Priors. PAMI 2008.

V.Lempitsky, P.Kohli, C.Rother, T.Sharp. Image segmentation with a Bounding Box Prior. ICCV 2009.

Important Dates:

A short presentation of your work: Tue, May 14 (in class)

Deadline of submitting the code and write-up: Wed, May 15 (5:00pm)

Project 3: Locality-constrained Linear Coding for Scene and Action Classification

Project Description:

Implement the LLC (Locality-constrained Linear Coding) method for image classification. Apply your method on a natural scene classification dataset.

Download the 15-class natural scene dataset from here.

Download the spatial pyramid code from here. Run this code on the scene classification dataset to extract spatial pyramid features, and then use SVM for image classification. You should test the algorithm with and without histogram intersection kernel and compare the results. It is highly encouraged to use LIBLINEAR for SVM classification because it can give you much higher speed and better performance. This paper contains the results of using spatial pyramid on the scene dataset.

Implement LLC method by modifying the spatial pyramid code. There are two things that you need to change. (1) Original spatial pyramid method uses a hard codeword assignment, while LLC uses locality constrained linear coding (Section 3 of the paper). (2) Original spatial pyramid uses sum pooling to form the histogram feature, while LLC uses max pooling (first few lines in page 5 of the paper).

You need to modify the BuildHistograms.m and CompilePyramid.m functions of the code. Note that a large codeword size (e.g. 1024, 2048) is helpful in LLC. You only need to use linear kernel for LLC. Compare your result with spatial pyramid results.

Based on the LLC method that you have implemented, change the original feature setting to multi-scale, densely sampled SIFT feature representation and apply it to a action classification dataset.

Download the PPMI (people-playing music instrument) dataset from here. You need to use the 12-class dataset of people playing different musical instruments. You only need to work on the normalized images. The training and testing images have been specified in the dataset.

Run the LLC method that you have implemented on this dataset.

In the current method, only one single SIFT scale and sampling spacing is used. Change it to the densely sampled SIFT feature representation. You can find details of dense SIFT representation in the "image representation" paragraph of page 3 of the Delaitre et al paper. Try different parameter settings of dense SIFT representation and compare your results with your previous experiment.

(Optional) Combining LLC and object bank for image classification.

Download the object bank code from here and use it for classifying the action images.

Combine the object bank method and LLC method for better classification results. Note that this is a relatively open problem and you can try any approach to combine them. For example, each method will give you a K-dimensional confidence score where K is the number of classes. You can simply concatenate the two K-dimensional vectors and train a new SVM on top of them.

Dataset, Resources, and Experiment:

Datasets: 15-class natural scene dataset, PPMI dataset

Useful code: Spatial pyramid, LIBLINEAR, Object bank

Evaluation metric: Evaluate your method using mean accuracy for all the classes. In the mean accuracy measure, you calculate the accuracy of each class, and average the accuracy values of all classes. Note that this is different from the overall accuracy that LIBLINEAR gives you.

Confusion matrix is a good way to understand your results. Please refer to the wiki page for a definition of confusion matrix.

For any experiment, it is encouraged to change the paramters and compare the results.

You can only use training images in all the stages of training, e.g. building codeword, training SVM, etc. You should assume that no test image is available during training.

You are encouraged to modify the provided SIFT or SVM code for your project, if it is necessary.

You might be able to find the LLC code online. It is definitely not acceptable to use that code.

In this project, you will get state-of-the-art result. If you combine LLC and object bank, your result might beat state-of-the-art!

It is important to try different parameters and experimental settings, and draw insights from the results you observe.

References:

S.Lazebnik, C.Schmid, and J.Ponce. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. CVPR 2006.

J.Wang, J.Yang, K.Yu, F.Lv, T.Huang, and Y.Gong. Locality-constrained Linear Coding for Image Classification. CVPR 2010.

V. Delaitre, I. Laptev, and J. Sivic. Recognizing Human Actions in Still Images: A Study of Bag-of-Features and Part-Based Representation. BMVC 2010.

J.Li, H.Su, E.Xing, and L.Fei-Fei. Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification. NIPS 2010.

B.Yao and L.Fei-Fei. Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions. CVPR 2010.

Important Dates:

A short presentation of your work: Tue, Jun 04 (in class)

Deadline of submitting the code and write-up: Wed, Jun 05 (5:00pm)