[0001] This application claims priority from U.S. Provisional Application No. 60/237,187, filed Oct. 3, 2000, and titled DUAL CAMERA CONTROL SYSTEM, which is incorporated by reference.
[0002] This invention relates to an object tracking system, and more particularly to a video camera based object tracking and interface control system.
[0003] A variety of operating systems are currently available for interacting with and controlling a computer system. Many of these operating systems use standardized interface functions based on commonly accepted graphical user interface (GUI) functions and control techniques. As a result, different computer platforms and user applications can be easily controlled by a user who is relatively unfamiliar with the platform and/or application, as the functions and control techniques are generally common from one GUI to another.
[0004] One commonly accepted control technique is the use of a mouse or trackball style pointing device to move a cursor over screen objects. An action, such as clicking (single or double) on the object, executes a GUI function. However, for someone who is unfamiliar with operating a computer mouse, selecting GUI functions may present a challenge that prevents them from interfacing with the computer system. There also exist situations where it becomes impractical to provide access to a computer mouse or trackball, such as in front of a department store display window on a city street, or while standing in front of a large presentation screen to lecture before a group of people.
[0005] In one general aspect, a method of tracking an object of interest is disclosed. The method includes acquiring a first image and a second image representing different viewpoints of the object of interest, and processing the first image into a first image data set and the second image into a second image data set. The method further includes processing the first image data set and the second image data set to generate a background data set associated with a background, and generating a first difference map by determining differences between the first image data set and the background data set, and a second difference map by determining differences between the second image data set and the background data set. The method also includes detecting a first relative position of the object of interest in the first difference map and a second relative position of the object of interest in the second difference map, and producing an absolute position of the object of interest from the first and second relative positions of the object of interest.
[0006] The step of processing the first image into the first image data set and the second image into the second image data set may include determining an active image region for each of the first and second images, and extracting an active image data set from the first and second images contained within the active image region. The step of extracting the active image data set may include one or more techniques of cropping the first and second images, rotating the first and second images, or shearing the first and second images.
[0007] In one implementation, the step of extracting the active image data set may include arranging the active image data set into an image pixel array having rows and columns. The step of extracting further may include identifying the maximum pixel value within each column of the image pixel array, and generating data sets having one row wherein the identified maximum pixel value for each column represents that column.
[0008] Processing the first image into a first image data set and the second image into a second image data set also may include filtering the first and second images. Filtering may include extracting the edges in the first and second images. Filtering further may include processing the first image data set and the second image data set to emphasize differences between the first image data set and the background data set, and to emphasize differences between the second image data set and the background data set.
[0009] Processing the first image data set and the second image data set to generate the background data set may include generating a first set of one or more background data sets associated with the first image data set, and generating a second set of one or more background data sets associated with the second image data set.
[0010] Generating the first set of one or more background data sets may include generating a first background set representing a maximum value of data within the first image data set representative of the background, and generating the second set of one or more background data sets includes generating a second background set representing a maximum value of data within the second image data set representative of the background. Generating further may include, for the first and second background sets representing the maximum value of data representative of the background, increasing the values contained within the first and second background sets by a predetermined value.
[0011] Generating the first set of one or more background data sets may include generating a first background set representing a minimum value of data within the first image data set representative of the background, and generating the second set of one or more background data sets may include generating a second background set representing a minimum value of data within the second image data set representative of the background. Generating further may include, for the first and second background sets representing the minimum value of data representative of the background, decreasing the values contained within the first and second background sets by a predetermined value.
[0012] Generating the first set of background data sets may include sampling the first image data set, and generating the second set of background data sets may include sampling the second image data set. Sampling may occur automatically at predefined time intervals, where each sample may include data that is not associated with the background.
[0013] Generating the first set of one or more background data sets may include maintaining multiple samples of the first image data set within each background data set, and generating the second set of one or more background data sets may include maintaining multiple samples of the second image data set within each background data set.
[0014] Generating each first background data set may include selecting from the multiple samples one value that is representative of the background for each element within the first image data set, and generating each second background data set may include selecting from the multiple samples one value that is representative of the background for each element within the second image data set. Selecting may include selecting the median value from all sample values in each of the background data sets.
[0015] In other implementations, generating may include comparing the first image data set to a subset of the background data set, and comparing the second image data set to a subset of the background data set.
[0016] In other implementations generating a first difference map further may include representing each element in the first image data set as one of two states, and generating a second difference map further may include representing each element in the second image data set as one of two states, where the two states represent whether the value is consistent with the background.
[0017] In still other implementations, detecting may include identifying a cluster in each of the first and second difference maps, where each cluster has elements whose state within its associated difference map indicates that the elements are inconsistent with the background.
[0018] Identifying the cluster further may include reducing the difference map to one row by counting the elements within a column that are inconsistent with the background. Identifying the cluster further may include identifying the column as being within the cluster and classifying nearby columns as being within the cluster. Identifying the column as being within the cluster also may include identifying the median column.
[0019] Identifying the cluster further may include identifying a position associated with the cluster. Identifying the position associated with the cluster may include calculating the weighted mean of elements within the cluster.
[0020] Detecting further may include classifying the cluster as the object of interest. Classifying the cluster further may include counting the elements within the cluster and classifying the cluster as the object of interest only if that count exceeds a predefined threshold. Classifying the cluster further may include counting the elements within the cluster and counting a total number of elements classified as inconsistent within the background within the difference map, and classifying the cluster as the object of interest only if the ratio of the count of elements within the cluster over the total number of elements exceeds a predefined threshold.
[0021] The step of detecting further may include identifying a sub-cluster within the cluster that represents a pointing end of the object of interest and identifying a position of the sub-cluster.
[0022] In the above implementations, the object of interest may be a user's hand, and the method may include controlling an application program using the absolute position of the object of interest.
[0023] The above implementations further may include acquiring a third image and a fourth image representing different viewpoints of the object of interest, processing the third image into a third image data set and the fourth image into a fourth image data set, and processing the third image data set and the fourth image data set to generate the background data set associated with the background. The method also may include generating a third difference map by determining differences between the third image data set and the background data set, and a fourth difference map by determining differences between the fourth image data set and the background data set, and detecting a third relative position of the object of interest in the third difference map and a fourth relative position of the object of interest in the fourth difference map. The absolute position of the object of interest may be produced from the first, second, third and fourth relative positions of the object of interest.
[0024] As part of this implementation, the object of interest may be a user's hand, and also may include controlling an application program using the absolute position of the object of interest.
[0025] In another aspect, a method of tracking an object of interest controlled by a user to interface with a computer is disclosed. The method includes acquiring images from at least two viewpoints, processing the acquired images to produce an image data set for each acquired image, and comparing each image data set to one or more background data sets to produce a difference map for each acquired image. The method also includes detecting a relative position of an object of interest within each difference map, producing an absolute position of the object of interest from the relative positions of the object of interest, and using the absolute position to allow the user to interact with a computer application.
[0026] Additionally, this method may include mapping the absolute position of the object of interest to screen coordinates associated with the computer application, and using the mapped position to interface with the computer application. This method also may include recognizing a gesture associated with the object of interest by analyzing changes in the absolute position of the object of interest, and combining the absolute position and the gesture to interface with the computer application.
[0027] In another aspect, a multiple camera tracking system for interfacing with an application program running on a computer is disclosed. The multiple camera tracking system includes two or more video cameras arranged to provide different viewpoints of a region of interest and are operable to produce a series of video images. A processor is operable to receive the series of video images and detect objects appearing in the region of interest. The processor executes a process to generate a background data set from the video images, generate an image data set for each received video image and compare each image data set to the background data set to produce a difference map for each image data set, detect a relative position of an object of interest within each difference map, and produce an absolute position of the object of interest from the relative positions of the object of interest and map the absolute position to a position indicator associated with the application program.
[0028] In the above implementation, the object of interest may be a human hand. Additionally, the region of interest may be defined to be in front of a video display associated with the computer. The processor may be operable to map the absolute position of the object of interest to the position indicator such that the location of the position indicator on the video display is aligned with the object of interest.
[0029] The region of interest may be defined to be any distance in front of a video display associated with the computer, and the processor may be operable to map the absolute position of the object of interest to the position indicator such that the location of the position indicator on the video display is aligned to a position pointed to by the object of interest. Alternatively, the region of interest may be defined to be any distance in front of a video display associated with the computer, and the processor may be operable to map the absolute position of the object of interest to the position indicator such that movements of the object of interest are scaled to larger movements of the location of the position indicator on the video display.
[0030] The processor may be configured to emulate a computer mouse function. This may include configuring the processor to emulate controlling buttons of a computer mouse using gestures derived from the motion of the object of interest. A sustained position of the object of interest for a predetermined time period may trigger a selection action within the application program.
[0031] The processor may be configured to emulate controlling buttons of a computer mouse based on a sustained position of the object of interest for a predetermined time period. Sustaining a position of the object of interest within the bounds of an interactive display region for a predetermined time period may trigger a selection action within the application program.
[0032] The processor may be configured to emulate controlling buttons of a computer mouse based on a sustained position of the position indicator within the bounds of an interactive display region for a predetermined time period.
[0033] In the above aspects, the background data set may include data points representing at least a portion of a stationary structure. In this implementation, at least a portion of the stationary structure may include a patterned surface that is visible to the video cameras. The stationary structure may be a window frame. Alternatively, the stationary structure may include a strip of light.
[0034] In another aspect, a multiple camera tracking system for interfacing with an application program running on a computer is disclosed. The system includes two or more video cameras arranged to provide different viewpoints of a region of interest and are operable to produce a series of video images. A processor is operable to receive the series of video images and detect objects appearing in the region of interest. The processor executes a process to generate a background data set from the video images, generate an image data set for each received video image, compare each image data set to the background data set to produce a difference map for each image data set, detect a relative position of an object of interest within each difference map, produce an absolute position of the object of interest from the relative positions of the object of interest, define sub regions within the region of interest, identify a sub region occupied by the object of interest, associate an action with the identified sub region that is activated when the object of interest occupies the identified sub region, and apply the action to interface with the application program.
[0035] In the above implementation, the object of interest may be a human hand. Additionally, the action associated with the identified sub region may emulate the activation of keys of a keyboard associated with the application program. In a related implementation, sustaining a position of the object of interest in any sub region for a predetermined time period may trigger the action.
[0036] The details of one or more implementations are set forth in the accompanying drawings and the description below. Other features and advantages will be apparent from the description and drawings, and from the claims.
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
[0048]
[0049]
[0050]
[0051]
[0052]
[0053]
[0054]
[0055]
[0056]
[0057]
[0058]
[0059]
[0060]
[0061]
[0062] Like reference symbols in the various drawings indicate like elements.
[0063]
[0064] The series of video images acquired from the cameras
[0065]
[0066] In summary, the object of interest
[0067] The processes performed by the image processor
[0068] Image detection modules
[0069] In block
[0070] Rotation causes the contents of an image to appear as if the image has been rotated. Rotation reorders the position of pixels from (x,y) to (x′,y′) according to the following equation:
[0071] where θ is the angle that the image is to be rotated.
[0072] If the cameras
[0073] where sh
[0074] An implementation of the multicamera control system
[0075] In some implementations of system
[0076] The particular implementation of the scaling module
[0077] An alternative implementation applies in scenarios where the controlled background
[0078] The preceding implementation allows the use of many existing surfaces, walls or window frames, for example, as the controlled background
[0079] The difference map
[0080] An implementation of the detection module
[0081] Continuing with this implementation of the detection module
[0082] In this implementation, a set of criteria is received by a cluster classification process
[0083] If the cluster
[0084] where:
[0085] {overscore (x)} is the mean
[0086] c is the number of columns
[0087] C[x] is the count of tagged pixels in column x.
[0088] The cluster's bounds
[0089] In addition to the middle and bound coordinates, the orientation of the object of interest
[0090] In some applications of the system
[0091] Continuing with this implementation, a process implemented by a smoothing module
[0092] A process used in the center of gravity process
[0093] In Eq. 1, the smoothed value at time t (s(t)) is equal to one minus the scalar value (a) multiplied by the smoothed value at time minus one (t−1). This amount is added to the raw value at time t (r(t)) multiplied by a scalar (a) that is between zero and one.
[0094] Referring to
[0095] Input data
[0096] In this implementation of the background model component
[0097] In this implementation, a median process block
[0098] The background model component
[0099] The duration that any one pixel of the controlled background
[0100] The preceding discussion presents one implementation of obtaining the position of the object of interest
[0101] Turning to
[0102] Eq. 3, as shown in
[0103] To approximate the angle beta (β), the inverse tangent is applied to the quantity of the focal length (f) divided by the position p on the image plane projected onto the intersection of the reference plane and the image plane.
[0104] For maximum precision, the intrinsic camera parameters (location of the principal point and scale of image) and radial distortion caused by the lens should be corrected for by converting the distorted position (as represented by the relative position information
[0105] Continuing with the description of combination module
[0106] A formula for measurement of the angles is shown in Eq. 4:
[0107] Measurement of the angle alpha (α) is equal to the angle beta_not (β
[0108] Eq. 4 is applied to measure the angles
[0109] Eq. 5 calculates the offset of the object of interest (y) by the formula:
[0110] The offset (y) is equal to the reciprocal of the tangent of the angle (α
[0111] Eq. 6 calculates the offset of the object of interest (x
[0112] In Eq. 6, the offset (x
[0113] The position of the object
[0114] In Eq. 7, the position (z) is calculated as the position (p) on the image plane projected onto the vector of the image plane perpendicular to that use in Eq. 3 divided by the focal length (f) multiplied by the distance of the object of interest
[0115] These relations provide a coordinate of the object of interest
[0116] Smoothing may optionally be applied to these coordinates in refinement module
[0117] In Eq. 8, s(t) is the smoothed value at time t, r(t) is the raw value at time t, D
[0118] Two distance thresholds, D
[0119] In Eq. 9, scalar (a) is bound such that equal to or greater than zero, and less than or equal to one, the dampening value of S is found by Eq. 8, and e is the elapsed time since the previous frame.
[0120] These coordinates
[0121] In a typical implementation of the system, the application program
[0122] In one variation of this form of user interface, the indicator, such as a mouse pointer, is shown in front of other graphics, and its movements are mapped to the two dimensional space defined by the surface of the screen. This form of control is analogous to that provided by a computer mouse, such as that used with the Microsoft® Windows® operating system. An example feedback image of an application that uses this style of control is shown as
[0123] Referring to
[0124] In Eq. 10, x