Vassilis Athitsos VLM Lab

3D Hand Pose Estimation

The 3D pose of a hand is defined by the joint angles and the orientation of the hand. Different configurations of joint angles lead to different hand shapes. The same hand shape can look very different depending on its 3D hand orientation. The joint angles in a hand can be specified using 20 parameters. Three additional parameters are needed for the 3D orientation of the hand. Therefore, 3D hand pose is specified using a total of 23 parameters.

Figure 1: Three different hand shapes. Each hand shape corresponds to a different configuration of joint angles.

Figure 2: Three different views of a single hand shape. Each view corresponds to a different 3D hand orientation.

In this project, 3D hand pose estimation is formulated as an image database indexing problem. A large database of synthetic hand images is created, that contains images of various hand shapes under various 3D orientations. For each synthetic image, the system knows the hand shape and 3D orientation that was used to create it. To estimate the 3D hand pose of an input image, the most similar images in the database are retrieved. The hand pose parameters associated with those images are used as estimates for the hand pose in the input image. The database images are created using computer graphics.

Figure 3: System input and output. Given the input image, the system goes through the database of synthetic images in order to identify the ones that are the most similar to the input image. Eight examples of database images are shown here, and the most similar one is enclosed in a red square. The database currently used contains more than 100,000 images.

Main Challenges

Building a Large Database

In order to correctly estimate the hand pose in an image, the database must contain a synthetic image with similar hand pose. A question that arises is: how many database images do we need in order to guarantee that for every possible hand pose there will be a similar database image? We do not have an answer to this question at this point. The answer partly depends on a definition of when two images are "similar"; the more stringent our criteria for similarity are, the more database images we need.

In our current implementation, the database includes images of 26 hand shapes. Each hand shape is rendered from 4128 different 3D hand orientations. Overall, the database contains 107,328 images. Our sampling of 26 hand shapes is definitely inadequate to capture the entire range of possible hand shapes. On the other hand, our sampling of 3D hand orientations is an approximately uniform and dense sampling of the space of 3D orientations.

Figure 4: The 26 hand shapes used to generate the over 100,000 database images.

Recognizing a fixed number of hand shapes is not as general as recognizing arbitrary 3D hand pose. However, it can be sufficient for many gesture recognition applications. For example, the number of basic hand shapes used in American Sign Language (ASL) is less than 100, which means that our framework is applicable to recognizing hand pose in the ASL context.

Defining Accurate Similarity Measures

Given an input image, the system must retrieve the most similar database images. Therefore, we need to address the question: how do we define similarity? We need to identify similarity measures that return high similarity value for images of similar hand pose, and low similarity values for images of different hand poses.

We have experimented with several different similarity measures: the chamfer distance, geometric moments, edge orientation histograms, finger matching, and line matching. Our experiments show that the chamfer distance is the most accurate similarity measure, but also the most computationally expensive measure. Current work focuses on developing similarity measures that can improve the accuracy and efficiency of the system. We are particularly interesting in similarity measures that are robust to noise, clutter, and segmentation errors. Our line matching method, described in our CVPR 2003 publication, is a first result in that direction.

Achieving Efficient Retrieval

For every input image, the system has to evaluate its similarity to all database images. That task can be very time-consuming, given the size of the database. On the other hand, in order for the system to be useful for human-computer interaction applications, the retrieval time must be done at interactive speeds.

We have found that retrieval efficiency is greatly improved if we use multi-step retrieval: first use computationally cheap similarity measures (like finger matching and geometric moments) to reject the bulk of database images, and then use more accurate similarity measures (like the chamfer distance) to evaluate the remaining database images. Retrieval accuracy, on the other hand, is pretty similar to the accuracy attained by applying all similarity measures to all images.

Another direction that we have explored for retrieval efficiency is deriving efficient approximations of computationally expensive similarity measures, like the chamfer distance. In our GW 2003 and CVPR 2003 publications we define approximations of the chamfer distance using a technique called Lipschitz embeddings. These approximations are ideal for the first retrieval steps, that identify the most likely matches. The exact measures can then be applied only to the selected matches, so that the overall efficiency is improved. Our CVPR 2004 paper introduces the BoostMap method, a general method for constructing embeddings. Applied on hand images, BoostMap greatly improves the approximation of the chamfer distance.

References

The papers in Cues in Communications 2001 and Face and Gesture 2002 describe the overall framework, the assumptions and the goals of the project. The papers in CVPR 2003 and the Gesture Workshop 2003 focus on methods to improve efficiency and accuracy in the presense of clutter and segmentation errors. The paper in CVPR 2004 introduces a general embedding method that can be used to efficiently approximate the chamfer distance.

3D Hand Pose Estimation by Finding Appearance-Based Matches in a Large Database of Training Views.
Vassilis Athitsos and Stan Sclaroff.
IEEE Workshop on Cues in Communication, December 2001.
[Postscript 2.5MB] [Compressed Postscript 597KB] [ PDF 367KB]

Extended version (Technical Report BUCS-2001-021):
[ Postscript 4.0MB] [ Compressed Postscript 777KB] [ PDF 470KB]
An Appearance-Based Framework for 3D Hand Shape Classification and Camera Viewpoint Estimation.
Vassilis Athitsos and Stan Sclaroff.
IEEE Conference on Automatic Face and Gesture Recognition, pages 45-52, May 2002.
[Postscript 1.7MB] [Compressed Postscript 682KB] [PDF 549KB]

Extended version (Technical Report BUCS-2001-022):
[ Postscript 6.3MB] [ Compressed Postscript 1.3MB] [ PDF 716KB]
Database Indexing Methods for 3D Hand Pose Estimation.
Vassilis Athitsos and Stan Sclaroff.
Gesture Workshop, pages 288-299, April 2003.
[Postscript 3.7MB] [Compressed Postscript 587KB] [PDF 267KB]
Estimating 3D Hand Pose from a Cluttered Image.
Vassilis Athitsos and Stan Sclaroff.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 432-439, June 2003.
[Postscript 7.4MB] [Compressed Postscript 1.7MB] [PDF 594KB]
BoostMap: A Method for Efficient Approximate Similarity Rankings.
Vassilis Athitsos, Jonathan Alon, Stan Sclaroff, and George Kollios.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 268-275, June 2004.
[Postscript 2.6MB] [Compressed Postscript 588KB] [PDF 229KB]
Vassilis Athitsos, Jonathan Alon, Stan Sclaroff, and George Kollios.
BoostMap: An Embedding Method for Efficient Nearest Neighbor Retrieval.
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) , 30(1), pages 89-104, January 2008.
[Postscript 51MB] [PDF 3.2MB] [Pre-print in PDF with color images 632KB]
Michalis Potamias and Vassilis Athitsos.
Nearest Neighbor Search Methods for Handshape Recognition.
Conference on Pervasive Technologies Related to Assistive Environments (PETRA), July 2008.
[Postscript 10.4MB] [PDF 259KB]

Vassilis Athitsos VLM Lab