[0001] This application is related to and claims priority from U.S. provisional application 60/371,178, filed Apr. 9, 2002, entitled “Dynamic Gesture Recognition from Stereo Sequences”; and PCT application RU01/00296, international filing date Jul. 18, 2001, entitled “Dynamic Gesture Recognition from Stereo Sequences”.
[0002] This invention relates to system interfaces in general, and more specifically to dynamic gesture recognition from stereo sequences.
[0003] The field of gesture recognition for computer systems has been developing in recent years. In general, a gesture recognition system will recognize physical gestures made by an individual and respond according to an interpretation of the gestures. Gesture recognition may be used in a computer interface, for interpreting sign language, in industrial control, in entertainment applications, or for numerous other purposes. The challenge in gesture recognition systems is to provide a simple, easy to use system that is also highly accurate in interpretation of gestures.
[0004] In conventional gesture recognition systems, the process may proceed as shown in
[0005] If the video frame is not the first frame in the sequence, process block
[0006] A conventional gesture recognition system is limited in a number of ways. The use of two-dimensional images may provide insufficient depth of field information to properly detect upper body positions, which may lead to misinterpretation of gestures. The need to initialize a gesture recognition system with specified gestures or the use of special devices creates additional difficulty in using a system and may discourage users from attempting to access a system that incorporates gesture recognition.
[0007] The appended claims set forth the features of the invention with particularity. The invention, together with its advantages, may be best understood from the following detailed descriptions taken in conjunction with the accompanying drawings, of which:
[0008]
[0009]
[0010]
[0011]
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
[0019] A method and apparatus are described for dynamic gesture recognition from stereo sequences.
[0020] In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be apparent to one skilled in the art that the present invention may be practiced without some of these specific details. In other instances, well-known structures and devices are shown in block diagram form.
[0021] The present invention includes various processes, which will be described below. The processes of the present invention may be performed by hardware components or may be embodied in machine-executable instructions, which may be used to cause a general-purpose or special-purpose processor or logic circuits programmed with the instructions to perform the processes. Alternatively, the processes may be performed by a combination of hardware and software.
[0022]
[0023] If the video frame is the first frame in the sequence, process block
[0024] Upon initializing the system or tracking the upper body of the subject to its new position, three-dimensional feature extraction, process block
[0025] Hidden Markov models are well-known processing systems and thus will not be explained in detail. A hidden Markov model is a finite set of states, with each of the states having a probability distribution. Transitions between the states in the model are governed by a set of probabilities that are referred to as transition probabilities. While in a particular state of the model, an observation can be made, but the actual state is not observable. For this reason, the states are referred to as hidden. In a particular embodiment, a continuous five states, left-to-right hidden Markov model is utilized. In this embodiment, no skip states are allowed and each state is modeled by a mixture of three Gaussian density functions. The model is illustrated in
[0026] An equipment arrangement for a particular embodiment is shown in
[0027] In an embodiment, a statistical framework for upper body segmentation is used. An embodiment includes tracking of the upper body from stereo images, and uses the trajectories of the hands of the subject as observations for HMM-based three-dimensional gesture recognition. Dense disparity maps are used in the system, generated from the stereo images. The system provides accurate gesture recognition when encountering varying illumination conditions, partial occlusions, and self-occlusions. Unlike conventional gesture recognitions systems that require a user guided initialization, the approach to upper body segmentation under an embodiment make use of a minimal set of assumptions regarding the relative position of the subject to the image capturing device for initialization. Following the initialization, the model parameters are tracked over consecutive frames and the new values of the parameters are updated, or re-initialized, using an expectation maximization algorithm. The three-dimensional positions of the hands of the subject are used as observation vectors for the gesture recognition system.
[0028] According to an embodiment, the video sequence is a novel stereo image of the subject. According to one embodiment, a depth disparity map is generated from the stereo image. According to another embodiment, the stereo image is obtained from a stereo camera that generates the needed depth information without the need for additional depth disparity map generation. The use of such a stereo camera allows the operation of the system without the need for the large number of computations that are required to generate a depth disparity map.
[0029] Additional details regarding of the gesture recognition system are provided as follows:
[0030] Image Model and Upper Body Model—The statistical model for the upper body consists of a set of three planar components, describing the torso and the arms of the subject, and a set of three Gaussian blob components, representing the head and hands of the subject. For the purposes of this description, the parameters of each planar component (the mth planar component) are referred to as π
[0031] In an image of the subject, an observation vector O
[0032] and of the color of the pixel in the image space O
[0033] If it is assumed that all of the observations vectors are independent, then the probability of a particular observation sequence
[0034] where P(O
[0035] where u
[0036] After the initialization of the upper body model, the values of the a priori probabilities are estimated from the corresponding parameters of the model states. Probabilities P(O
[0037] where μ is the mean vector and C is the covariance of the Gaussian probability density function. For the purposes of an embodiment of the gesture recognition system, the parameters of the Gaussian components are designated as β=(μ,C). Because the color distribution and the three-dimensional position can be considered to be independent random variables, the probability of the observation vectors O
[0038] In equation [5], P(O
[0039] From equation [6], it then can be discerned that the planar probability density function describes a Gaussian distribution with mean μ=ax
[0040] Upper Body Segmentation—Model Initialization—The optimal set of parameters for the upper body model are obtained through an estimation maximization (EM) algorithm by maximizing P(
[0041] The initialization process is in essence a sequence of two-class classification problems that are repeated for each component of the model. In each of these problems, the data is assigned to either one component of the upper body or to a “residual” class of the remaining unassigned data. The data assigned to the residual class in the first classification problem becomes the input to the second classification process, where it either is re-assigned to the next body component or becomes a part of the new residual class. This process is repeated until all of the data is classified or until all of the upper body components are initialized. The remaining residual class is modeled by a uniform distribution. Note that that the embodiment described herein utilizes a particular order of segmentation, but those in the field will recognize that other segmentation orders are possible and embodiments are not limited to the description provided herein.
[0042] A block diagram of an initialization process is shown in
[0043] Embodiments of the initialization segmentation processes are described in more depth as follows:
[0044] Background-Foreground Segmentation—The first process of the model initialization is the background segmentation. All of the pixels in the image that are farther away from the camera than a predetermined threshold, or for which there is not valid depth information, are assigned to the background. The remaining pixels are assigned to the upper body. If a stationary background is assumed, then the use of colors may improve the segmentation results. However, a stationary background is often a difficult condition to maintain, and making the wrong assumption on the background statistics can dramatically decrease the accuracy of the segmentation results. For this reason, in a particular embodiment the depth information alone is used for background-foreground segmentation.
[0045]
[0046] Torso Segmentation—Any pixels assigned to the foreground are either generated by the torso plane or by the residual class of uniform distribution. Assuming that all observation vectors are independent random variables, the probability of observation vectors O
[0047] where u
[0048]
[0049] With the covariance matrix C being:
[0050] From this, the a posteriori probability γ
[0051] The EM algorithm is repeated until convergence is reached, which is when P(
[0052]
[0053] Head Segmentation—The initial position of the head is determined by searching the area above the torso. However, it is possible that the head was included within the torso plane and the area above the torso contains a small number of noisy points. In this case, the system looks for the head in the upper region of the torso. Exploiting the depth information further, the apparent size of the head in image plane can be obtained from the distance and orientation of the torso plane from the camera. The probability of the observation sequence O
[0054] In equation [16], u
[0055] The pixels for each P(O
[0056] Arms Segmentation—The arms are modeled by planar density functions. The planar density model does not restrict the natural degrees of freedom in the arm motion and provides a good description of the data available for the arms in stereo images. The parameters of the planes corresponding to the left and right arms are obtained using the same equations used for the torso plane. The regions of search for the left and right arms consist of the pixels on the left and right side of the torso center that were not previously assigned to the torso or the head.
[0057]
[0058] Hands Segmentation—The hands are modeled using Gaussian density functions. Similar to the modeling of the head of the subject, the observations for the hand models consist of the three-dimensional position and the hue value of the pixels. Several conventional approaches to gesture recognition use the a priori information about the skin color to detect the hands and or the face in an image. However, these approaches often fail in environments characterized by strong variations in illumination. Instead, an embodiment initializes the position of the hands by finding the regions of the arm planes that have the color similar to the hue value obtained from the head segmentation. The parameters of the hand Gaussian blobs are then determined using the same EM algorithm for the Gaussian density functions used to estimate the parameters of the head blob.
[0059]
[0060] Tracking the Upper Body Model—The initial parameters obtained individually for the torso, head, arms, and hands are refined by estimating them simultaneously. The optimal set of parameters for the upper body model are obtained through the EM algorithm by setting the derivatives of E{P(
[0061] In the E (estimation) process, the new set of the plane parameters are re-estimated according to the equations [8] through [11] and the Gaussian blob parameters are re-estimated using equations [17] and [18]. The pixels for which
[0062] are assigned to plane π
[0063] are assigned to Gaussian blob β
[0064] Gesture Recognition—Hidden Markov models (HMM) are a popular tool for the classification of dynamic gestures because of the flexibility of such models in the modeling of signals while preserving the essential structure of the hand gestures. In an embodiment herein, a HMM-base recognition system for gesture recognition that uses as observation vectors the trajectory of the hands of the subject in three-dimensional space. Although the hand trajectories in the image plane are conventional features for gesture recognition, the trajectories in a two-dimensional image plane cannot unambiguously describe the motion of the hands in a plane perpendicular to the image plane. The use of disparity maps enables the trajectory of the hands in three-dimensional space to be obtained, and these trajectories are used as observation vectors in an embodiment. Further, the use of disparity maps in combination with color information result in the robust segmentation of the upper body that is largely independent of illumination conditions or changes in the background scene.
[0065] The use of dense disparity maps for gesture recognition is helpful because stereo is considerably more robust than color alone to variations in illumination conditions, and because depth disparity maps reduce the inherent depth ambiguity present in two-dimensional images and therefore enables more accurate segmentation of images under partial occlusions and self-occlusions.
[0066] The use of depth disparity maps add some complications to the gesture recognition process. Stereo algorithms are often difficult and laborious to develop and are computationally expensive. Correspondence-based stereo algorithms may produce noisy disparity maps. However, consumer stereo cameras have become more available and the performance of personal computers has increased such that stereo computation can be done at reasonable frame rates. An example of a camera that is used in an embodiment is the Digiclops Stereo Vision System of Point Grey Research, Inc. of Vancouver, British Columbia. Since the performance of a dynamic gesture recognition system greatly depends on the quality of the observation vector sequences, the use of stereo images in a system requires extra care. The use of depth maps instead of color information to describe the upper body model is one very important element in building a system that is provides robust performance in varying illumination conditions, shadow effects, non-stationary background scenes, and occlusions and self-occlusions of the upper body.
[0067] In the foregoing specification, the invention has been described with reference to specific embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.