[0001] This application claims priority from U.S. Provisional Application No. 60/220,223, filed Jul. 24, 2000, and titled VIDEO-BASED IMAGE CONTROL SYSTEM, which is incorporated by reference.
[0002] This invention relates to an image processing system, and more particularly to a video-based image control system for processing stereo image data.
[0003] A variety of operating systems are currently available for interacting with and controlling a computer system. Many of these operating systems use standardized interfaces based on commonly accepted graphical user interface (GUI) functions and control techniques. As a result, different computer platforms and user applications can be easily controlled by a user who is relatively unfamiliar with the platform and/or application, as the functions and control techniques are generally common from one GUI to another.
[0004] One commonly accepted control technique is the use of a mouse or trackball style pointing device to move a cursor over screen objects. An action, such as clicking (single or double) on the object, executes a GUI function. However, for someone who is unfamiliar with operating a computer mouse, selecting GUI functions may present a challenge that prevents them from interfacing with the computer system. There also exist situations where it becomes impractical to provide access to a computer mouse or trackball, such as in front of a department store display window on a city street, or where the user is physically challenged.
[0005] In one general aspect, a method of using stereo vision to interface with a computer is disclosed. The method includes capturing a stereo image and processing the stereo image to determine position information of an object in the stereo image. The object may be controlled by a user. The method further includes using the position information to allow the user to interact with a computer application.
[0006] The step of capturing the stereo image may include capturing the stereo image using a stereo camera. The method also may include recognizing a gesture associated with the object by analyzing changes in the position information of the object, and controlling the computer application based on the recognized gesture. The method also include determining an application state of the computer application, and using the application state in recognizing the gesture. The object may be the user. In another instance, the object is a part of the user. The method may include providing feedback to the user relative to the computer application.
[0007] In the above implementation, processing the stereo image to determine position information of the object may include mapping the position information from position coordinates associated with the object to screen coordinates associated with the computer application. Processing the stereo image also may include processing the stereo image to identify feature information and produce a scene description from the feature information.
[0008] Processing the stereo image also may include analyzing the scene description to identify a change in position of the object and mapping the change in position of the object. Processing the stereo image to produce the scene description also may include processing the stereo image to identify matching pairs of features in the stereo image, and calculating a disparity and a position for each matching feature pair to create a scene description.
[0009] The method may include analyzing the scene description in a scene analysis process to determine position information of the object.
[0010] Capturing the stereo image may include capturing a reference image from a reference camera and a comparison image from a comparison camera, and processing the stereo image also may include processing the reference image and the comparison image to create pairs of features.
[0011] Processing the stereo image to identify matching pairs of features in the stereo image also may include identifying features in the reference image, generating for each feature in the reference image a set of candidate matching features in the comparison image, and producing a feature pair by selecting a best matching feature from the set of candidate matching features for each feature in the reference image. Processing the stereo image also may include filtering the reference image and the comparison image.
[0012] Producing the feature pair may include calculating a match score and rank for each of the candidate matching features, and selecting the candidate matching feature with the highest match score to produce the feature pair.
[0013] Generating for each feature in the reference image, a set of candidate matching features may include selecting candidate matching features from a predefined range in the comparison image.
[0014] Feature pairs may be eliminated based upon the match score of the candidate matching feature. Feature pairs also may be eliminated if the match score of the top ranking candidate matching feature is below a predefined threshold. The feature pair may be eliminated if the match score of the top ranking candidate matching feature is within a predefined threshold of the match score of a lower ranking candidate matching feature.
[0015] Calculating the match score may include identifying those feature pairs that are neighboring, adjusting the match score of feature pairs in proportion to the match score of neighboring candidate matching features at similar disparity, and selecting the candidate matching feature with the highest adjusted match score to create the feature pair.
[0016] Feature pairs may be eliminated by applying the comparison image as the reference image and the reference image as the comparison image to produce a second set of feature pairs, and eliminating those feature pairs in the original set of feature pairs which do not have a corresponding feature pair in the second set of feature pairs.
[0017] The method may include for each feature pair in the scene description, calculating real world coordinates by transforming the disparity and position of each feature pair relative to the real world coordinates of the stereo image. Selecting features may include dividing the reference image and the comparison image of the stereo image into blocks. The feature may be described by a pattern of luminance of the pixels contained with the blocks. Dividing also may include dividing the images into pixel blocks having a fixed size. The pixel blocks may be 8×8 pixel blocks.
[0018] Analyzing the scene description to determine the position information of the object also may include cropping the scene description to exclude feature information lying outside of a region of interest in a field of view. Cropping may include establishing a boundary of the region of interest.
[0019] Analyzing the scene description to determine the position information of the object also may include clustering the feature information in a region of interest into clusters having a collection of features by comparison to neighboring feature information within a predefined range, and calculating a position for each of the clusters. Analyzing the scene description also may include eliminating those clusters having less than a predefined threshold of features.
[0020] Analyzing the scene description also may include selecting the position of the clusters that match a predefined criteria, recording the position of the clusters that match the predefined criteria as object position coordinates, and outputting the object position coordinates. The method also may include determining the presence of a user from the clusters by checking features within a presence detection region. Calculating the position for each of the clusters may exclude those features in the clusters that are outside of an object detection region.
[0021] The method may include defining a dynamic object detection region based on the object position coordinates. Additionally, the dynamic object detection region may be defined relative to a user's body.
[0022] The method may include defining a body position detection region based on the object position coordinates. Defining the body position detection region also may include detecting a head position of the user. The method also many include smoothing the motion of the object position coordinates to eliminate jitter between consecutive image frames.
[0023] The method may include calculating hand orientation information from the object position coordinates. Outputting the object position coordinates may include outputting the hand orientation information. Calculating hand orientation information also may include smoothing the changes in the hand orientation information.
[0024] Defining the dynamic object detection region also may include identifying a position of a torso-divisioning plane from the collection of features, and determining the position of a hand detection region relative to the torso-divisioning plane in the axis perpendicular to the torso divisioning plane.
[0025] Defining the dynamic object detection region may include identifying a body center position and a body boundary position from the collection of features, identifying a position indicating part of an arm of the user from the collection of features using the intersection of the feature pair cluster with the torso divisioning plane, and identifying the arm as either a left arm or a right arm using the arm position relative to the body position.
[0026] This method also may include establishing a shoulder position from the body center position, the body boundary position, the torso-divisioning plane, and the left arm or the right arm identification. Defining the dynamic object detection region may include determining position data for the hand detection region relative to the shoulder position.
[0027] This technique may include smoothing the position data for the hand detection region. Additionally, this technique may include determining the position of the dynamic object detection region relative to the torso divisioning plane in the axis perpendicular to the torso divisioning plane, determining the position of the dynamic object detection region in the horizontal axis relative to the shoulder position, and determining the position of the dynamic object detection region in the vertical axis relative to an overall height of the user using the body boundary position.
[0028] Defining the dynamic object detection region may include establishing the position of a top of the user's head using topmost feature pairs of the collection of features unless the topmost feature pairs are at the boundary, and determining the position of a hand detection region relative to the top of the user's head.
[0029] In another aspect, a method of using stereo vision to interface with a computer is disclosed. The method includes capturing a stereo image using a stereo camera, and processing the stereo image to determine position information of an object in the stereo image, wherein the object is controlled by a user. The method further includes processing the stereo image to identify feature information, to produce a scene description from the feature information, and to identify matching pairs of features in the stereo image. The method also includes calculating a disparity and a position for each matching feature pair to create the scene description, and analyzing the scene description in a scene analysis process to determine position information of the object. The method may include clustering the feature information in a region of interest into clusters having a collection of features by comparison to neighboring feature information within a predefined range, calculating a position for each of the clusters, and using the position information allow the user to interact with a computer application.
[0030] Additionally, this technique may include mapping the position of the object from the feature information from camera coordinates to screen coordinates associated with the computer application, and using the mapped position to interface with the computer application.
[0031] The method may include recognizing a gesture associated with the object by analyzing changes in the position information of the object in the scene description, and combining the position information and the gesture to interface with the computer application. The step of capturing the stereo image may include capturing the stereo image using a stereo camera.
[0032] In another aspect, a stereo vision system for interfacing with an application program running on a computer is disclosed. The stereo vision system includes first and second video cameras arranged in an adjacent configuration and operable to produce a series of stereo video images. A processor is operable to receive the series of stereo video images and detect objects appearing in an intersecting field of view of the cameras. The processor executes a process to define an object detection region in three-dimensional coordinates relative to a position of the first and second video cameras, select a control object appearing within the object detection region, and map position coordinates of the control object to a position indicator associated with the application program as the control object moves within the object detection region.
[0033] The process may select as a control object a detected object appearing closest to the video cameras and within the object detection region. The control object may be a human hand.
[0034] A horizontal position of the control object relative to the video cameras may be mapped to a x-axis screen coordinate of the position indicator. A vertical position of the control object relative to the video cameras may be mapped to a y-axis screen coordinate of the position indicator.
[0035] The processor may be configured to map a horizontal position of the control object relative to the video cameras to a x-axis screen coordinate of the position indicator, map a vertical position of the control object relative to the video cameras to a y-axis screen coordinate of the position indicator, and emulate a mouse function using the combined x-axis and y-axis screen coordinates provided to the application program.
[0036] The processor may be configured to emulate buttons of a mouse using gestures derived from the motion of the object position. The processor may be configured to emulate buttons of a mouse based upon a sustained position of the control object in any position within the object detection region for a predetermined time period. In other instances, the processor may be configured to emulate buttons of a mouse based upon a position of the position indicator being sustained within the bounds of an interactive display region for a predetermined time period. The processor may be configured to map a z-axis depth position of the control object relative to the video cameras to a virtual z-axis screen coordinate of the position indicator.
[0037] The processor may be configured to map a x-axis position of the control object relative to the video cameras to a x-axis screen coordinate of the position indicator, map a y-axis position of the control object relative to the video cameras to a y-axis screen coordinate of the position indicator, and map a z-axis depth position of the control object relative to the video cameras to a virtual z-axis screen coordinate of the position indicator.
[0038] A position of the position indicator being within the bounds of an interactive display region may trigger an action within the application program. Movement of the control object along a z-axis depth position that covers a predetermined distance within a predetermined time period may trigger a selection action within the application program.
[0039] A position of the control object being sustained in any position within the object detection region for a predetermined time period may trigger part of a selection action within the application program.
[0040] In another aspect, a stereo vision system for interfacing with an application program running on a computer is disclosed. The stereo vision system includes first and second video cameras arranged in an adjacent configuration and operable to produce a series of stereo video images. A processor is operable to receive the series of stereo video images and detect objects appearing in the intersecting field of view of the cameras. The processor executes a process to define an object detection region in three-dimensional coordinates relative to a position of the first and second video cameras, select as a control object a detected object appearing closest to the video cameras and within the object detection region, define sub regions within the object detection region, identify a sub region occupied by the control object, associate with that sub region an action that is activated when the control object occupies that sub region, and apply the action to interface with a computer application.
[0041] The action associated with the sub region is further defined to be an emulation of the activation of keys associated with a computer keyboard. A position of the control object being sustained in any sub region for a predetermined time period may trigger the action.
[0042] In yet another aspect, a stereo vision system for interfacing with an application program running on a computer is disclosed. First and second video cameras are arranged in an adjacent configuration and are operable to produce a series of stereo video images. A processor is operable to receive the series of stereo video images and detect objects appearing in an intersecting field of view of the cameras. The processor executes a process to identify an object perceived as the largest object appearing in the intersecting field of view of the cameras and positioned at a predetermined depth range, select the object as an object of interest, determine a position coordinate representing a position of the object of interest, and use the position coordinate as an object control point to control the application program.
[0043] The process also may cause the processor to determine and store a neutral control point position, map a coordinate of the object control point relative to the neutral control point position, and use the mapped object control point coordinate to control the application program.
[0044] The process may cause the processor to define a region having a position based upon the position of the neutral control point position, map the object control point relative to its position within the region, and use the mapped object control point coordinate to control the application program. The process also may cause the processor to transform the mapped object control point to a velocity function, determine a viewpoint associated with a virtual environment of the application program, and use the velocity function to move the viewpoint within the virtual environment.
[0045] The process may cause the processor to map a coordinate of the object control point to control a position of an indicator within the application program. In this implementation the indicator may be an avatar.
[0046] The process may cause the processor to map a coordinate of the object control point to control an appearance of an indicator within the application program. In this implementation the indicator may be an avatar. The object of interest may be a human appearing within the intersecting field of view.
[0047] In another aspect, a stereo vision system for interfacing with an application program running on a computer is disclosed. The stereo vision system includes first and second video cameras arranged in an adjacent configuration and operable to produce a series of stereo video images. A processor is operable to receive the series of stereo video images and detect objects appearing in an intersecting field of view of the cameras. The processor executes a process to identify an object perceived as the largest object appearing in the intersecting field of view of the cameras and positioned at a predetermined depth range, select the object as an object of interest, define a control region between the cameras and the object of interest, the control region being positioned at a predetermined location and having a predetermined size relative to a size and a location of the object of interest, search the control region for a point associated with the object of interest that is closest to the cameras and within the control region, select the point associated with the object of interest as a control point if the point associated with the object of interest is within the control region, and map position coordinates of the control point, as the control point moves within the control region, to a position indicator associated with the application program.
[0048] The processor may be operable to map a horizontal position of the control point relative to the video cameras to a x-axis screen coordinate of the position indicator, map a vertical position of the control point relative to the video cameras to a y-axis screen coordinate of the position indicator, and emulate a mouse function using a combination of the x-axis and the y-axis screen coordinates.
[0049] Alternatively, the processor also may be operable to map a x-axis position of the control point relative to the video cameras to a x-axis screen coordinate of the position indicator, map a y-axis position of the control point relative to the video cameras to a y-axis screen coordinate of the position indicator, and map a z-axis depth position of the control point relative to the video cameras to a virtual z-axis screen coordinate of the position indicator.
[0050] In the stereo vision system, the object of interest may be a human appearing within the intersecting field of view. Additionally, the control point may be associated with a human hand appearing within the control region.
[0051] In yet another aspect, a stereo vision system for interfacing with an application program running on a computer is disclosed. First and second video cameras are arranged in an adjacent configuration and are operable to produce a series of stereo video images. A processor is operable to receive the series of stereo video images and detect objects appearing in an intersecting field of view of the cameras. The processor executes a process to define an object detection region in three-dimensional coordinates relative to a position of the first and second video cameras, select up to two hand objects from the objects appearing in the intersecting field of view that are within the object detection region, and map position coordinates of the hand objects, as the hand objects move within the object detection region, to positions of virtual hands associated with an avatar rendered by the application program.
[0052] The process may select the up to two hand objects from the objects appearing in the intersecting field of view that are closest to the video cameras and within the object detection region. The avatar may take the form of a human-like body. Additionally, the avatar may be rendered in and interact with a virtual environment forming part of the application program. The processor may execute a process to compare the positions of the virtual hands associated with the avatar to positions of virtual objects within the virtual environment to enable a user to interact with the virtual objects within the virtual environment.
[0053] The processor also may execute a process to detect position coordinates of a user within the intersecting field of view, and map the position coordinates of the user to a virtual torso of the avatar rendered by the application program. The process may move at least one of the virtual hands associated with the avatar to a neutral position if a corresponding hand object is not selected.
[0054] The processor also may execute a process to detect position coordinates of a user within the intersecting field of view, and map the position coordinates of the user to a velocity function that is applied to the avatar to enable the avatar to roam through a virtual environment rendered by the application program. The velocity function may include a neutral position denoting zero velocity of the avatar. The processor also may execute a process to map the position coordinates of the user relative to the neutral position into torso coordinates associated with the avatar so that the avatar appears to lean.
[0055] The processor also may execute a process to compare the position of the virtual hands associated with the avatar to positions of virtual objects within the virtual environment to enable the user to interact with the virtual objects while roaming through the virtual environment.
[0056] As part of the stereo vision system, a virtual knee position associated with the avatar may be derived by the application program and used to refine an appearance of the avatar. Additionally, a virtual elbow position associated with the avatar may be derived by the application program and used to refine an appearance of the avatar.
[0057] The details of one or more implementations are set forth in the accompanying drawings and the description below. Other features and advantages will be apparent from the description and drawings, and from the claims.
[0058]
[0059]
[0060]
[0061]
[0062]
[0063]
[0064]
[0065]
[0066]
[0067] FIGS.
[0068]
[0069]
[0070]
[0071]
[0072]
[0073]
[0074]
[0075]
[0076]
[0077]
[0078] Like reference symbols in the various drawings indicate like elements.
[0079]
[0080] As will be described in greater detail below, the computing apparatus
[0081]
[0082] Certain motions made by the hand(s), which are detected as changes in the position of the hand(s) and/or other features represented as the hand/object position information
[0083] The detection of gestures may be context sensitive, in which case an application state
[0084] The image detector
[0085] Also, few limitations are imposed on the appearance of the user and hand, as it is the general three-dimensional shape of the person and arm that is used to identify the hand. The user
[0086] Typically, the scene description information
[0087] An implementation of a stereo camera image detector
[0088] Turning to
[0089] With respect to
[0090] The epipolar line pairs are dependent on the distortion in the cameras' images and the geometric relationship between the cameras
[0091] One implementation of a stereo analysis process
[0092] A matching process
[0093] Due to occlusion, a reference feature (produced by process
[0094] The result of the above procedure is a depth description map
[0095] This transformed depth description map produced by transformation process
[0096]
[0097] Typically, the region of interest
[0098] Often, parts of the background can be detected to be within the region of interest
[0099] After feature cropping, the next step is to cluster the remaining features into collections of one or more features by way of a feature clustering process
[0100] Continuing with this implementation, clusters are filtered using a cluster filtering process
[0101] The presence or absence of a person is determined by a presence detection module
[0102] In the described implementation of system
[0103] The hand detection region
[0104] In those scenarios where the orientation of the cameras is such that the person's arm is detectable, the orientation is represented as hand orientation coordinates
[0105] An alternative method that also yields good results, in particular when the features are not evenly distributed, is as follows. The position where the arm enters the hand detection region
[0106] A dynamic smoothing process
[0107] The described method by which the hand detection region
[0108] The simplest hand detection region
[0109] In some scenarios, the hand detection region
[0110]
[0111] Using the cluster data
[0112] Process block
[0113] If the user's head is entirely within the region of interest
[0114] In many scenarios, it can be determined whether the user's left or right arm is associated with each hand that is detected in the position calculation block
[0115] The position of the middle of the user's body and the bounds of the user's body are also found in process block
[0116] In process block
[0117] If one hand is identified by process block
[0118] Process blocks
[0119] The smoothed position information output from the dynamic smoothing process
[0120] In summary, the above implementation described by
[0121] Presence/absence or count of users
[0122] For each present user:
[0123] Left/Right bounds of the body or torso
[0124] Center point of the body or torso
[0125] Top of the head (if the head is within the region of interest)
[0126] For each present hand:
[0127] The hand detection region
[0128] A label of “Left”, “Right” (if detectable)
[0129] The position of the tip of the hand
[0130] The orientation of the hand or forearm
[0131] Given improvements in the resolution of the scene description
[0132] This hand/object position information
[0133] Through processing the above information, a variety of human gestures can be detected that are independent of the application
[0134] A large subset of these gestures may be detected using heuristic techniques. The detection process