Detecting and Registering Objects of Variable Structure
We have proposed the first method in computer vision for detecting and registering objects and shapes of variable structure in complicated, real-world images that may include large amounts of noise and clutter. Here we provide a very brief and high-level introduction to our method; more details can be found in our related publications.
What Do We Mean by "Variable Structure"?
Object and shape classes of variable structure are simply object and shape classes that do not have a fixed number and type of parts. Three examples can be seen in Figure 1 below: branches of leaves have an unknown number of leaves, combs have an unknown number of teeth, and handshapes have an unknown number of visible fingers. Obviously, each particular instance of such an object or shape has a specific number of parts. However, we cannot know this number of parts unless we see the object. This is unlike objects of fixed structure such as faces, for example, where the number and type of parts is known (two eyes, one nose, one mouth...).
Figure 1. Three shape classes that exhibit variable structure: branches with leaves, hair combs, and hand contours. Our goal is to automatically detect such shapes in cluttered images.
We should note that "variable structure" is a different term than "deformation", as illustrated in Figure 2. While many methods address detection and registration of deformable objects, those methods still cannot handle variable structure.
Figure 2. Left: four deformations of a branch shape. The structure of the shape is the same in all four deformations. Right: three different structures of a branch shape, containing different numbers of leaves.
Detection and Registration
"Detecting" an object means finding where that object is in an image. "Registering" an object means finding finding where each part of the object is located in the image. Detection and registration are particularly challenging in images with clutter, i.e., images containing other objects that we are not interested in detecting. Figures 3 and 4 display a few examples of cluttered images containing objects of variable structure, and results of our method.
Figure 3. Top row: images showing branches of leaves and additional clutter. Middle row: binary edge images (extracted automatically by a Canny edge detector). These images are given as input to our algorithm. Bottom row: output of our algorithm, showing the contour of the detected branches.
Figure 4. Top row: images showing hands and additional clutter. Middle row: binary edge images (extracted automatically by a Canny edge detector). These images are given as input to our algorithm. Bottom row: output of our algorithm, showing the contour of the detected hands.
Our method performs detection and registration by first extracting features from the image (in Figures 3 and 4 the features are simply edge pixels) and then matching those features with a graphical model that we call a Hidden State Shape Model, or HSSM in short. HSSMs are a generalization of Hidden Markov Models (HMMs). The key difference is that HMMs have been designed for matching sequences of features, for example in speech recognition and gesture recognition, where features are ordered based on the time they are observed. In our problem, all image features are observed simultaneously, and therefore they are unordered. Matching between an HSSM and a set of features is done using dynamic programming, and in particular using a variant of the Viterbi algorithm. To determine the optimal length of the matching (i.e., the optimal number of features to register with the model), the matching cost includes terms both for features matched with the model and for features matched with clutter.
More details can be found in our publications. The ECCV 2006 paper introduces HSSMs, and the PAMI paper extends our method to automatically detect the optimal scale of the object and optimal length of the matching.
Extension to Tracking
Significant speedups are obtained by applying HSSMs within a tracking framework, where the detection/registration results from the previous frame are used to restrict the search for the optimal registration in the current frame. This way, unconstrained search needs to be performed only for the first frame of the video sequence. We have implemented a hand tracker along these lines. The tracker initializes automatically and tolerates significant amounts of clutter and occlusion.
Here are five video sequences showing tracking results. Click on each image to play the entire video sequence. The left part shows the original image, the middle part shows the result of edge detection (where we combine a Canny edge detector and a binary classifier trained to recognize foreground edges based on skin color), and the right part shows the detection/registration result, where different colors are used for features matched to different states.
Detecting Instances of Shape Classes That Exhibit Variable Structure.
Vassilis Athitsos, Jingbin Wang, Stan Sclaroff, and Margrit Betke.
European Conference on Computer Vision (ECCV), May 2006.
Jingbin Wang, Vassilis Athitsos, Stan Sclaroff, and Margrit Betke.
Detecting Objects of Variable Shape Structure With Hidden State Shape Models.
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 30(3), pages 477-492, March 2008.
Zheng Wu, Margrit Betke, Jingbin Wang, Vassilis Athitsos, and Stan Sclaroff.
Tracking with Dynamic Hidden State Shape Models.
European Conference on Computer Vision (ECCV), pages 643-656, October 2008.