Plaque It!
Sponsored by: Flash of Genius |
[0001] This invention relates generally to machine vision systems, and more particularly, to visual recognition of objects under partial occlusion, clutter, or non-linear contrast changes.
[0002] In object recognition, and in particular in many machine vision tasks, one is interested in recognizing a user-defined model object in an image. The object in the image may have undergone arbitrary transformations of a certain class of geometric transformations. If the class of transformations is the class of translations, one is interested in obtaining the position of the model in the image. The class of translations is typically used if it can be ensured that the model always occurs in the same rotation and size in the image, e.g., because it is mounted at a fixed angle on a x-y-stage and the camera is mounted in a fixed position perpendicular to the stage. If the class of transformations is the class of rigid transformations, additionally the rotation of the object in the image is desired. This class of transformations can, for example, be used if the camera is mounted perpendicular to the stage, but the angle of the object cannot be kept fixed. If the class of transformations is the class of similarity transformations, additionally the size of the object in the image may vary. This class of transformations can occur, for example, if the distance between the camera and the object cannot be kept fixed or if the object itself may undergo size changes. If neither the position nor the 3D rotation of the camera with respect to the object can be kept fixed, the object will undergo a general perspective transformation in the image. If the interior orientation of the camera is unknown, a perspective projection between two planes (i.e., the surface of the object and the image plane) can be described by a 3×3 matrix in homogeneous coordinates:
[0003] The matrix and vectors are only determined up to an overall scale factor (see Hartley and Zisserman (2000) [Richard Hartley and Andrew Zisserman: Multiple View Geometry in Computer Vision. Cambridge University Press, 2000], chapters 1.1-1.4). Hence, the matrix, which determines the pose of the object, has eight degrees of freedom. If the interior orientation of the camera is known, these eight degrees of freedom reduce to the six degrees of freedom of the pose of the object with respect to the camera (three for translation and three for rotation).
[0004] Often, this type of transformation is approximated by a general 2D affine transformation, i.e., a transformation where the output points (x′,y′,)
[0005] General affine transformations can, for example, be decomposed into the following, geometrically intuitive, transformations: A scaling of the original x and y axes by different scaling factors s
[0006]
[0007] Several methods have been proposed in the art to recognize objects in images. Most of them suffer from the restriction that the model will not be found in the image if it is occluded or degraded by additional clutter objects. Furthermore, most of the existing methods will not detect the model if the image exhibits non-linear contrast changes, e.g., due to illumination changes.
[0008] All of the known object recognition methods generate an internal representation of the model in memory at the time the model is generated. To recognize the model in the image, in most methods the model is systematically compared to the image using all allowable degrees of freedom of the chosen class of transformations for the pose of the object (see, e.g., Borgefors (1988) [Gunilla Borgefors. Hierarchical chamfer matching: A parametric edge matching algorithm.
[0009] The simplest class of object recognition methods is based on the gray values of the model and image itself and uses normalized cross correlation as a match metric (see U.S. Pat. Nos. 4,972,359, 5,222,155, 5,583,954, 5,943,442, 6,088,483, and Brown (1992), for example). Normalized cross correlation has the advantage that it is invariant to linear brightness changes, i.e., the object can be recognized if it has undergone linear illumination changes. However, normalized cross correlation has several distinct disadvantages. First, it is very expensive to compute, making the methods based on this metric very slow. This leads to the fact that the class of transformations is usually chosen as the class of translations only because otherwise the search would take too much time for real-time applications, even if image pyramids are used. Second, the metric is not robust to occlusions of the object, i.e., the object will usually not be found even if only small parts of it are occluded in the image. Third, the metric is not robust to clutter, i.e., the object will usually not be found if there are disturbances on or close to the object.
[0010] Another class of algorithms is also based on the gray values of the model and image itself, but uses either the sum of the squared gray value differences or the sum of the absolute value of the gray value differences as the match metric (see U.S. Pat. No. 5,548,326 and Brown (1992), for example). This metric can be made invariant to linear brightness changes (Lai and Fang (1999) [Shang-Hong Lai and Ming Fang. Accurate and fast pattern localization algorithm for automated visual inspection.
[0011] A more complex class of object recognition methods does not use the gray values of the model or object itself, but uses the edges of the object for matching. During the creation of the model, edge extraction is performed on the model image and its derived image pyramid (see, e.g., Borgefors (1988), Rucklidge (1997), and U.S. Pat. No. 6,005,978). Edge extraction is the process of converting a gray level image into a binary image in which only the points corresponding to an edge are set to the value 1, while all other pixels receive the value 0, i.e., the image is actually segmented into an edge region. Of course, the segmented edge region need not be stored as a binary image, but can also be stored by other means, e.g., runlength encoding. Usually, the edge pixels are defined as the pixels in the image where the magnitude of the gradient is maximum in the direction of the gradient. Edge extraction is also performed on the image in which the model is to be recognized and its derived image pyramid. Various match metrics can then be used to compare the model to the image. One class of match metrics is based on measuring the distance of the model edges to the image edges under the pose under consideration. To facilitate the computation of the distances of the edges, a distance transform is computed on the image pyramid. The match metric in Borgefors (1988) computes the average distance of the model edges and the image edges. Obviously, this match metric is robust to clutter edges since they do not occur in the model and hence can only decrease the average distance from the model to the image edges. The disadvantage of this match metric is that it is not robust to occlusions because the distance to the nearest edge increases significantly if some of the edges of the model are missing in the image. The match metric in Rucklidge (1997) tries to remedy this shortcoming by calculating the k-th largest distance of the model edges to the image edges. If the model contains n points, the metric hence is robust to 100*k/n% occlusion. Another class of match metrics is based on simple binary correlation, i.e., the match metric is the average of all points in which the model and the image under the current pose both have an edge pixel set (see U.S. Pat. Nos. 6,005,978 and 6,111,984, for example). To speed up the search for potential instances of the model, in U.S. Pat. No. 6,005,978 the generalized Hough transform (Ballard (1981) [D. H. Ballard. Generalizing the Hough transform to detect arbitrary shapes.
[0012] Evidently, the state-of-the-art methods for object recognition possess several shortcomings. None of the approaches is robust against occlusion, clutter, and non-linear contrast changes at the same time. Furthermore, often computationally expensive preprocessing operations, e.g., distance transforms or generalized Hough transforms, need to be performed to facilitate the object recognition. In many applications it is necessary that the object recognition step is robust to the types of changes mentioned above. For example, in print quality inspection, the model image is the ideal print, e.g., of a logo. In the inspection, one is interested in determining whether the current print deviates from the ideal print. To do so, the print in the image must be aligned with the model (usually by a rigid transformation). Obviously the object recognition (i.e., the determination of the pose of the print) must be robust to missing characters or parts thereof (occlusion) and to extra ink in the print (clutter). If the illumination cannot be kept constant across the entire field of view, the object recognition obviously must also be robust to non-linear illumination changes. Hence, it is an object of the present invention to provide an improved visual recognition system and method for occlusion- and clutter-invariant object recognition. It is a further object to provide a visual recognition system and method for occlusion-, clutter-, and illumination-invariant object recognition.
[0013] These objects are achieved with the features of the claims.
[0014] This invention provides a system and method for object recognition that is robust to occlusion, clutter, and non-linear contrast changes.
[0015] The model of the object to be recognized consists of a plurality of points with a corresponding directional vector, which can be obtained by standard image preprocessing algorithms, e.g., line or edge detection methods. At the time of creation of the model, the model is stored in memory by transforming the model image by a plurality of transformations from the class of geometric transformations by which the model may be distorted, e.g., rigid transformations. To recognize the model in the image, the same preprocessing operations that were used at the time of creation of the model are applied to the image in which the model should be found. Hence, for example if line detection was used to construct the model, line filtering is used on the image under consideration. The object is recognized by computing a match metric for all possible transformations of the model in the image. The match metric takes into account the geometrical information of the model and the image, i.e., the positions and directions of the points in the model and in the image. The match metric can be, for example, the sum of the dot product of one of the (precomputed) transformed models and the preprocessed image, or—in an alternative embodiment—the sum of the normalized dot product of one of the (precomputed) transformed models and the preprocessed image. Since the unnormalized dot product only relies on geometric information, it is not necessary to segment (binarize) the model image and the image in which the model is to be found. This makes the method robust against occlusion and clutter. If the normalized dot product is used as the match metric, it is preferred to segment the model image to obtain those points where the direction information is reliable. Again, the image in which the model is to be found is not segmented, leading to true robustness against arbitrary illumination changes, as well as occlusion and clutter. The location of a model in the image is given by the set of transformations (poses) where the match metric is higher than a certain, user-selectable threshold, and is locally maximal within the class of selected transformations.
[0016] To speed up the object recognition process, preferably the space of allowable transformations is searched using a recursive coarse-to-fine strategy.
[0017] The parameters of the found instances of the model in the image, e.g., the translation and rotation, can be used to control a robot or any other device that uses such geometric information.
[0018] The following description is presented to enable any person skilled in the art to make and use the invention. Descriptions of specific applications are provided only as examples.
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027] The present invention provides a method for object recognition that is robust to occlusion, clutter, and non-linear contrast changes.
[0028] Match Metric
[0029] The model of an object consists of a plurality of points with a corresponding directional vector. Typically, the model is generated from an image of the object, where an arbitrary region of interest (ROI) specifies that part of the image in which the object is located. The ROI can, for example, be specified interactively by the user of the system. Alternatively, the ROI could, for example, be generated by the use of machine vision methods, e.g., various segmentation operations like thresholding, morphology, etc. Details of the model generation method will be discussed below.
[0030]
[0031] In light of the foregoing, the model consists of points p
[0032] As discussed above, the match metric by which the transformed model is compared to the image content must be robust to occlusions, clutter, and lighting changes. One possible metric which achieves this goal according to one embodiment of the present invention is to sum the (unnormalized) dot product of the direction vectors of the transformed model and the image over all points of the model to compute a matching score at a particular point (x,y)
[0033] The advantage of this match metric is that neither the model image nor the image in which the model should be recognized need to be segmented (binarized), i.e., it suffices to use a filtering operation that only returns direction vectors instead of an extraction operation which also segments the image. Therefore, if the model is generated by edge or line filtering, and the image is preprocessed in the same manner, this match metric fulfills the requirements of robustness to occlusion and clutter. If parts of the object are missing in the image, there are no lines or edges at the corresponding positions of the model in the image, i.e., the direction vectors e
[0034] However, with the above match metric, if the image brightness is changed, e.g., by a constant factor, the match metric changes by the same amount. Therefore, it is preferred to modify the match metric. By calculating the sum of the normalized dot product of the direction vectors of the transformed model and the image over all points of the model, i.e.:
[0035] Because of the normalization of the direction vectors, this match metric is additionally invariant to arbitrary illumination changes. In this preferred embodiment all vectors are scaled to a length of 1, and what makes this metric robust against occlusion and clutter is the fact that if an edge or line is missing, either in the model or in the image, noise will lead to random direction vectors, which, on average, will contribute nothing to the sum.
[0036] The above match metric will return a high score if all the direction vectors of the model and the image align, i.e., point in the same direction. If edges are used to generate the model and image vectors, this means that the model and image must have the same contrast direction for each edge. This metric would, for example, be able to recognize only crosses that are darker than the background if the model is generated from a cross that is darker than the background. Sometimes it is desirable to be able to detect the object even if its contrast is reversed. To make the match metric robust against such global changes of contrast, the absolute value of the sum of the normalized dot products can be used according to a further preferred embodiment of the present invention, i.e., the match metric becomes:
[0037] This match metric means geometrically that all direction vectors in the image (as a whole) must either point in the same direction or in the opposite direction as the direction vectors in the model.
[0038] In rare circumstances, it might be necessary to ignore even local contrast changes, e.g., if the objects to be recognized consist of a medium gray body, which can have either a darker or lighter print on it. In this case, according to a further preferred embodiment, the match metric can be modified to be the sum of the absolute values of the normalized dot products:
[0039] Geometrically, this match metric means that each direction vector in the image individually must either point in the same or opposite direction as the corresponding direction vector in the model.
[0040] The above three normalized match metrics are robust to occlusion in the sense that the object will be found if it is occluded. As mentioned above, this results from the fact that the missing object points in the instance of the model in the image will on average contribute nothing to the sum. For any particular instance of the model in the image, this may not be true, e.g., because the noise in the image is not uncorrelated. This leads to the undesired fact that the instance of the model will be found in different poses in different images, even if the model does not move in the images, because in a particular image of the model the random direction vectors will contribute slightly different amounts to the sum, and hence the maximum of the match metric will change randomly. To make the localization of the model more precise, it is useful to set the contribution of the missing model points in the image to zero. The easiest way to do this is to set all inverse lengths1/∥e
[0041] All three normalized match metrics have the property that they return a number smaller than 1 as the score of a potential match. In all cases, a score of 1 indicates a perfect match between the model and the image. Furthermore, the score roughly corresponds to the portion of the model that is visible in the image. For example if the object is 50% occluded, the score cannot exceed 0.5. This is a highly desirable property because it gives the user the means to select an intuitive threshold for when an object should be considered as recognized.
[0042] Since the dot product of the direction vectors is related to the angle the direction vectors enclose by the arc cosine function, other match metrics could be defined that also capture the geometrical meaning of the above match metrics. One such metric is to sum up the absolute values of the angles that the direction vectors in the model and the direction vectors in the image enclose. In this case, the match metric would return values greater or equal to zero, with a value of zero indicating a perfect match. In this case, the pose of the model must be determined from the minimum of the match metric.
[0043] Object Recognition Method
[0044] To find the object in the image, the a-priori unbounded search space needs to be bounded. This is achieved through the user by setting thresholds for the parameters of the search space. Therefore, in case of affine transformations the user specifies thresholds for the two scaling factors, the skew angle, and the rotation angle:
[0045] The bounds for the translation parameters could also be specified by two thresholds each, but this would limit the space of translations to rectangles. Therefore, according to the method of the invention the bounds for the translation parameters are more conveniently specified as an arbitrary region of interest in the image in which the model should be recognized.
[0046] The simplest form of matching is to discretize the bounded search space, e.g., as described below, to transform the model by all combinations of transformation parameters thus obtained, and to compute the match metric for all resulting transformed models. This results in a score for all possible parameter combinations. After this, all valid object instances can be selected by requiring that the score at the respective set of parameters is above a minimum threshold selected by the user, i.e., m≧m
[0047] If the object to be recognized possesses symmetries, e.g., the rotation and reflection symmetry of the cross in
[0048] In the usual mode of operation, all instances that fulfill the score and overlap criterion are returned by the method of the invention. Sometimes it is known a-priori how many instances of the model need to be found in the image. Therefore, the user may specify the number o of instances of the model to be found in the image. In this case, only the o best instances after the removal of overlapping instances are returned.
[0049] Discretization of the Search Space
[0050] All of the match metrics disclosed in this invention require a large overlap of the transformed model and the object in the image for the correct score to be computed. The degree of overlap is influenced by the preprocessing steps taken to obtain the direction information. If line or edge filtering methods are used to obtain the direction information, the degree of overlap of the model and the instance in the image directly depends on the degree of smoothing that was used in the line or edge filter. If a filter with little smoothing is used, e.g., a Sobel edge filter, all points of the transformed model must lie within approximately one pixel of the instance in the image (depending on how blurred the edges of the object appear in the image by the optics of the camera) so that the correct score is obtained. If the image is smoothed before the edge or line extraction is performed, this distance becomes larger in proportion to the amount of smoothing that is applied because the edges or lines are broadened by the smoothing operation. For example, if a mean filter of size k×k is applied before or in the feature extraction (e.g., line or edge filtering), the model points must lie within k pixels of the instance. Similar results hold for other smoothing filters, e.g., the Gaussian filter used in the Canny edge extractor and the Steger line detector.
[0051] The transformation space needs to be discretized in a manner that the above requirement of all model points lying at most k pixels from the instance in the image can be ensured.
[0052] Speed-Up of the Recognition Method
[0053] The exhaustive search algorithm described above will find the object in the image if it is present, but the runtime of the method will be fairly high. Various methods are used according to preferred embodiments of the present invention to speed up the recognition process.
[0054] First, the sum of the dot products is preferably not computed completely, since the score must be above the minimum score m
[0055] Obviously, all the remaining n−j terms of the sum are all smaller or equal than one. Therefore, the partial score m
[0056] Second, the normalization of the gradient lengths for the model points is obviously best done at the time the model is created, i.e., the model points are stored with direction vectors of length 1. The inverse gradient lengths 1/∥e
[0057] Another large speed-up is obtained by searching the space of allowable transformations using a recursive coarse-to-fine strategy. The object recognition method described above corresponds to a search in a search space with the finest possible discretization where the step lengths are set to the values that are obtained for k=1. This results in a relatively slow exhaustive search. To speed up the search, the exhaustive search can be carried out in a search space where the step lengths are multiples of the step length at the finest level. The matches found in this coarsely discretized search space must then be tracked through progressively more finely discretized search spaces until the match is found in the finest level of discretization. In all levels of discretization, the note on smoothing in the section on the discretization of the search space must be observed, i.e., temporary images for each level of discretization must be created where each image must be smoothed by an amount that corresponds to the step length of the corresponding discretization level in order to ensure that the object is always found. A good choice for the step length parameters for the different discretization levels is to set them to the values that are obtained if k=2
l Δt Δt ΔΨ 0 1 1 0.573 1 2 2 1.146 2 4 4 2.292 3 8 8 4.585
[0058] For a further speed-up, the number of points in the model is preferably also reduced by a factor k=2
[0059] If the space of transformations is the space of translations or the space of rigid transformations the method described above will already have close to optimal runtime. If the space of transformations is larger, e.g., the space of similarity or affine transformations, it may be possible to speed up the search by using methods such as the generalized Hough transform (Ballard (1981)) or the method described in Rucklidge (1997), which potentially rules out large parts of the transformation space quickly, to identify potential matches more quickly in the coarsest level of the discretization space. Because the potential of false matches is higher with these methods, it is essential to use the method disclosed here to verify and track the potential matches through the levels of the discretization space. Furthermore, because the accuracy of the potential matches will be poor, the pose returned by the preprocessing steps must be enlarged to a region in the transformation space by a sufficient amount to ensure that the true match based on the method disclosed here will be found.
[0060] With these explications, the preferred object recognition method can be seen in
[0061] Then, the image is transformed into a representation that is consistent with the recursive subdivision of the search space (step 2). In the preferred embodiment the user selects the coarsest subdivision of the search space by specifying the parameter l
[0062] After this step, feature extraction is performed in each of the l
[0063] Hereupon, an exhaustive search through the coarsest level l
[0064] Once the exhaustive match on the coarsest discretization level is complete, the found instances are tracked through the finer levels of the discretization space until they are found at the lowest level of the discretization space (step 5). The tracking is performed as follows: The first unprocessed model instance is removed from the list of model instances. This is the unprocessed instance with the best score, since the list of instances is sorted by the score. The pose parameters of this instance are then used to define a search space in the next lower level of the discretization. Ideally, the model would be located at the position given by the appropriate transformation of the pose parameters, i.e., the scaling parameters s
[0065] On the finest level of the discretization space, found instances are checked if they overlap too much with other instances at the time the instance is inserted into the list (step 6). As described above, the overlap between the instances is calculated as the ratio of the area of the intersection of the smallest enclosing rectangle of arbitrary orientation around each pair of instances and the smaller of the two rectangles. If the overlap is larger than a user-supplied fraction, only the instance with the better score is kept in the list. If the user has not specified a maximum number of instances to find, the recursive tracking of the model stops if all found instances are on the finest level of the discretization. If the user has specified a maximum number o of instances to find, the recursive tracking of the model stops if the number of instances found on the finest discretization level is less than o and if all found instances are in the finest level of the discretization, i.e., if there are fewer instances in the image than the number specified by the user. Alternatively, the search stops if o instances have been found in the finest level of the discretization. The tracking method then checks all unprocessed instances in coarser levels of discretization to see if their score is close enough to the score of the worst found instance on the finest level because these instances might lead to better scores in the finer levels of discretization than the best o instances found so far. If an unprocessed instance has a score better than a constant, e.g., 0.9, times the worst score found on the finest level, this instance is also tracked recursively through the search space in the above manner to ensure that the best o matches are found. This means that all extraneous instances, i.e., all instances found over the limit o, are removed in this step 6.
[0066] If the user has specified that the pose should be returned with a better resolution than the finest discretization level, the maximum of the match metric corresponding to each found instance is extrapolated with subpixel resolution (step 7). This can, for example, be done by calculating the first and second derivatives of the match metric with respect to the parameters of the chosen space of transformations. The first and second derivatives can be obtained, for example, from scores neighboring the maximum score by convolution with the appropriate first and second derivative masks, e.g., n-dimensional facet model masks (see Steger (1998) [Carsten Steger. An unbiased detector of curvilinear structures.
[0067] While the extraction of the pose with a better resolution than the finest discretization level using the extrapolation of the maximum of the match metric already results in poses, which are accurate enough for almost all applications (typically better than {fraction (1/20)} pixel in position and {fraction (1/10)}° in rotation, for example), in rare cases it might be desirable to extract the pose with an even greater accuracy. This can be done by a least-squares fit of the points in the model to the points of the found instance of the model in the image. Note that this requires extracting points from the image, which was not necessary so far. This point will be discussed below. Traditionally, the least-squares fit would be performed by finding for each model point the corresponding point in the image and minimizing the average distance between the model points and the image points. Of course, not every point in the model needs to have a corresponding point in the image, e.g., because of occlusion. Therefore, the correspondence results in a subset q
[0068] With this approach, the model and image points must be extracted with subpixel precision. If they are only extracted with pixel precision, the model points on average cannot be moved closer to the image points than approximately 0.25 pixels because of the discrete nature of the model and image points, and hence no improvement in the accuracy of the pose would result. However, even if the model and image points are extracted with subpixel precision the model and the image cannot be registered perfectly because usually the image and model points will be offset laterally, which typically results in a nonzero average distance even if the model and the found instance would align perfectly. Furthermore, the traditional least-squares approach neglects the direction information inherent in the model and the image. These shortcomings can be overcome by minimizing the distance of the image points from the line through the corresponding model point that is perpendicular to the direction stored in the model. For edges and lines, this line is parallel to the model edge or line. The line through the model point in the direction perpendicular to the model direction vector is given by d
[0069] An approach of this type for determining a rigid transformation is described in Wallack and Maocha (1998) [Aaron Wallack and Dinesh Manocha. Robust Algorithms for Object Localization.
[0070] The desired pose parameters A and t are related to the thus computed pose parameters A′ and t′ by inverting the corresponding map, i.e., A=A′
[0071] Finally, the extracted poses of the found instances are returned to the user (step 8).
[0072] Model Generation
[0073] The model must be generated in accordance with the matching strategy discussed above. At the heart of the model generation is the feature extraction that computes the points of the model and the corresponding direction vector. This can be done by a number of different image processing algorithms. In one preferred embodiment of the invention, the direction vector is the gradient vector of the model image, which can be obtained from standard edge filters, e.g., the Sobel, Canny (see Canny (1986) [John Canny. A computational approach to edge detection.
[0074] The complete model generation method is displayed in
[0075] Then (step
[0076] After this, for each level of discretization, appropriate models are generated (step
[0077] For each level of discretization, the search space is sampled according to the discussion of the object recognition method above, using user-specified bounds on the linear transformation parameters:
[0078] The translation parameters are not sampled, i.e., fixed translation parameters t
[0079] In step (5), the transformed image of the current level of discretization, i.e., the image at the current level of the pyramid or the appropriately smoothed image, which was generated in step (
[0080] After the image has been transformed, the chosen feature extraction algorithm is applied to the transformed image (
[0081] Finally, the model points obtained in step (
[0082] The model generation strategy described above may generate a very large number of precomputed models of the search space of allowable transformations is large. This leads to the fact that the memory required to store the precomputed models will be very large, which either means that the model cannot be stored in memory or must be paged to disk on systems that support virtual memory. In the second case, the object recognition will be slowed down because the parts of the model that are needed in the recognition phase must be paged back into the main memory from disk. Therefore, if the memory required to store the precomputed models becomes too large, an alternative model generation strategy is to omit step (
[0083] The model generation strategy above transforms the image with each allowable transformation of the set of transformations in each discretization level because it tries to take into account possible anisotropic results of the feature extractor, i.e., the fact that the direction vectors the feature extractor returns may depend in a biased manner on the orientation of the feature of the image. If it is known that the feature extractor is isotropic, i.e., that the direction vectors that the feature extractor returns are correct, no matter in which orientation they occur in the image, the image transformation step can be omitted. Instead, the extracted feature points and direction vectors themselves can be transformed to obtain a precomputed set of models for all possible transformations. This model generation method is displayed in