| EP0711078 | Picture coding apparatus and decoding apparatus | |||
| EP0907147 | Clip display method and display device therefor | |||
| JP7193748 | ||||
| JP8181995 | ||||
| JP10257436 | AUTOMATIC HIERARCHICAL STRUCTURING METHOD FOR MOVING IMAGE AND BROWSING METHOD USING THE SAME |
This invention relates to a signal processing method for measuring the similarity between mutually different arbitrary segments constituting signals and a image-voice processing apparatus for measuring the similarities between mutually different arbitrary image and/or voice segments constituting video signals.
There is a case where it is desirable to search and reproduce interesting parts and often desired parts from an image application composed of massive different image data, for example a TV program recorded in video data.
In searching video data and other multimedia data, essentially unlike data used in many computer application, one cannot expect to find exactly identical data and similar ones are searched. Therefore almost all the technologies out of those relating to search on the multimedia data are based on similarity-based search as described in “G. Ahanger and T. D. C. Little, A survey of technologies for parsing and indexing digital video, J. of Visual Communication and Image Representation 7:28-4. 1996.”
In such search technologies based on similarity, the similarity of the contents is measured numerically in the first place. And in this technology, the measurements of similarity are used to rank those data of descending levels of similarity beginning with the highest level based on the standard of measuring similarity with the subject item. In a list obtained thereby, the most similar data themselves appear near the top of the list.
In such a search method based on the contents of multimedia data, image data, voice data, and essentially the video processing technologies based on signal processing are used in the first place to extract a low level feature of multimedia data. And in this search method, the inventors extracted low level features to find a standard of measuring similarity required for searches based on similarity.
Studies on searches based on the contents of multimedia data are often focussed at first on images (still images) searches. In such studies, the similarity among images is measured by a large number of low level image features such as color, texture, shape, etc.
And lately studies on searches based on the contents of video data have also been conducted. In the case of video data, identical parts in long video data are usually searched. Therefore in most technologies related to CBR (Contents Base Retrieval), video data are at first divided into a stream of frames called segments. Those segments are the subject of searches based on similarity. As for the existing method for dividing video data into segments, for example, usually a shot detection algorithm is used to divide video data into so-called shots as described in “G. Ahanger and T. D. C. Little, A survey of technologies for parsing and indexing digital video, J. of Visual Communication and Image Representation 7:28-4. 1996.” And in such search, the features that enable comparison based on similarity from the shot obtained are extracted.
However, it is difficult to identify the remarkable features of shots and detect features that enable to compare shots based on similarity. Therefore, the existing approach to search based on the contents of video data was, in place of the above-mentioned method, usually to extract representative frames from each shot and search for those representative frames. Those representative frames are generally called “key frames.” In other words, search technologies based on the contents of shot are attributed to search technologies based on the contents of image by comparing shot key frames. For example, when colour histograms are extracted from key frames for each shot, and the histograms of these key frames can be used to measure the similarity of two shots. This approach is also effective for selecting the key frame.
A simple approach is to regularly select a fixed frame from each shot. Another method for selecting a large number of frames is to use the frame-difference described in “B. L. Yeo and B. Liu, Rapid scene analysis on compressed video, IEEE Transactions on Circuits and Systems for Video Technology, vol.5, no.6, pp.533, December 1995”, the motion analysis described in “W. Wolf, Key frame selection by motion analysis, Proceedings of IEEE Int'l Conference on Acoustic, Speech and Signal Proceeding, 1996”, and the clustering technology described in “Y. Zhuang, Y. Rui, T. Huang and S. Mehrotra, Adaptive key frame extraction using unsupervised clustering, Proceedings of TEEE Int'l Conference on Image Proceeding, Chicago, Ill. Oct. 4-7 1998.”
Incidentally, the above-mentioned search technology based on key frames is limited to searches based on the similarity of shots. However, for example, since a typical 30-minutes TV program contains hundreds of shots, in the above-mentioned prior search technology a tremendous number of extracted shots need to be checked and searching such a huge number of data was quite a burden.
Therefore, it was necessary to mitigate the burden by comparing the similarities among, for example, scenes and programs in which segments are grouped together based on a certain correlation and other image and voice segments longer than shots.
However, the prior search technologies have not met the requirements for, for example, searching segments similar to specific commercials or, searching scenes similar to a scene consisting of related group of shots describing an identical performance in a TV program.
As mentioned above, almost no published studies devoted to comparisons based on the similarity of segments at higher levels than shots have been found. The only study of this kind is “J. Kender and B. L. Yeo, Video Scene Segmentation via Continuous Video Coherence, IBM Research Report, RC21061, Dec. 18, 1997”. This study provides a method for comparing the similarities between two scenes. The search technology in this study classifies all the shots of video data into categories and then counts the number of shots in every scene attributed to each category. The result obtained is a histogram that can be compared by the standard criteria of comparing similarity. It is reported that the study was successful to some extent in comparing similarity among similar scenes.
However, this method requires the classifications of all the shots of video data. Classifying all the shots is a difficult task and usually needs a technology requiring an enormous amount of computation.
Even if this method could exactly classify all the shots, it did not take into account the similarity between categories, and therefore the method could give confusing results. For example, suppose that a shot of video data are divided into three categories A, B, and C, a scene X has no shot of the categories B and C but has two shots of the category A, and another scene Y has no shot of the categories A and C but has two shots of the category B. In this case, according to the method, no similarity is found to exist between the scene X and the scene Y. However, if the shots in the category A and the category B are mutually similar, the similarity value should not be zero. In other words, the fact that in this method no similarities of shots themselves are taken into account sometimes leads to such a misjudgment.
This invention was made in view of such a situation, and has an object of solving the problems mentioned above of the prior search technologies, and of providing a signal processing method and an image-voice processing apparatus for search based on the similarity of segments of various levels in various video data.
The signal processing method related to the present invention designed to attain the above object is a signal processing method that extracts signatures defined by the representative segments which are sub-segments that represent the contents of segments constituting signals supplied out of the sub-segments contained in the segments and a weighting function that allocates weight to these representative segments including a group selection step that selects object groups for the signatures out of the groups obtained by a classification based on an arbitrary attribute of the sub-segment, a representative segment selection step that selects a representative segment out of the groups selected in the group selection step, and a weight computing step that computes the weight for the representative segment obtained in the selection step.
The signal processing method related to the present invention extracts the signature related to the segment.
The image-voice processing apparatus related to the present invention designed to attain the above object is an image-voice processing apparatus that extracts signatures defined by the representative segments which are image and/or voice sub-segments that represent the contents of the image and/or voice segments constituting video signals supplied out of the image and/or voice sub-segments contained in the image and/or voice segments and a weighting function that allocates weight to these representative segments including an execution means that selects object groups for the signatures out of the groups obtained by a classification based on an arbitrary attribute of the image and/or voice sub-segments, selects a representative segment from these selected groups and computes a weight for the representative segment obtained thereby.
The image-voice processing apparatus related to this invention thus configured extracts signatures relating to the image and/or voice segment.
A specific mode of carrying out this invention is hereinafter described in details with reference to the drawings.
A mode for carrying out this invention is an image voice processing apparatus that automatically extracts data representing arbitrary sets within video data in order to search and extract automatically desired contents from the video data. Before describing specifically this image voice processing apparatus, video data forming the subject matter of this invention will be described to begin with.
The video data forming the subject matter of this invention are turned into a model as shown in FIG.
As for segments in video data, there are segments formed by a stream of successive frames, those that assemble streams of such frames as a scene, and then those that assemble such scenes by a certain association. And in a broad sense, a single frame can be considered as a type of segment.
In other words, the segment in video data is a generic name independently given from the height of the relevant layer and is defined as a certain successive part of a stream of video data. Of course, a segment may be formed by a stream of successive frames mentioned above, or an intermediary structure having a certain meaning such as an intermediary structure to a scene. On the other hand, for example, if any segment X is completely contained in a different segment Y, the segment X is defmed as a sub-segment of the segment Y.
Such video data in general include both image and voice data. In other words, in these video data the frames include single still image frames and voice frames representing voice information that have been typified during a short period of time such as several tens to several hundreds of milliseconds/length.
Segments also include image segments and voice segments. In other words, segments include so-called shots each consisting of a stream of image frames successively shot by a single camera, or image segments of scenes grouped together into certain meaningful units using a feature representing this characteristic. Furthermore, segments include voice segments that have been formed by being defmed by periods of silence within video data detected by a generally known method, those that have been formed by a stream of voice frames classified into a small number of categories such as for example voice, music, noises, silence, etc. as described in “D. Kimber and L. Wilcox, Acoustic Segmentation for Audio Browsers, Zerox Parc Technical Report,” those determined by means of voice cut detection that detects important changes in certain features between two successive voice frames as described in “S. Pfeiffer, S. Fischer and E. Wolfgang, Automatic Audio Content Analysis, Proceeding of ACM Multimedia 96, November 1996, pp21-30,” and those that grouped streams of voice frames into meaningful sets based on a certain feature.
The image voice processing apparatus shown here as a mode for carrying out this invention automatically extracts signatures which are general feature characterizing the contents of segments in the above-mentioned video data and compares at the same time the similarity between two signatures. It can be applied to both image segments and voice segments. The standard of measuring similarity obtained thereby provides a general purpose tool for searching and classifying segments.
The following is an explanation about the signature. The signature generally identifies certain objects and consists of some data that identify the objects with a high precision by means of a smaller quantity of information than the objects themselves. For example, as a signature for human beings, finger prints may be mentioned as a type thereof In other words, the comparison of similarity of two sets of finger prints found on a body enables to determine precisely whether a same person has left the finger prints on it.
Similarly, a signature related to image segments and voice segments is a datum that enables to distinguish image segments and voice segments. This signature is given as a weighted set of the above-mentioned sub-segments obtained by dividing a segment. For example, a signature X related to a segment X is, as mentioned below, defined as a pair <R, W> consisting of a representative segment R composed of sub-segments representing the segment X and a weighting function W that allocates weights to each element of this representative segment R.
For the purpose of the explanation below, the term “r frame (representative frame)” is expanded to refer to the representative segment as “r segment.” And accordingly the set of all the r segments including a signature is called “the r segment of the signature.” And the type of r segment is called the r type of segment. And when it is necessary to indicate the r type of signature, the relevant type should precede the word “signature.” For example, an image frame signature represents a signature whose r segment consists entirely of image frames. And a shot signature represents a signature whose r segment is the above-mentioned shot. On the other hand, a segment described by a signature S is referred to as the object segment. For signature, an image segment, a voice segment or a r segment that includes a set of both of these segments may be used.
Such a signature has some features representing effectively segments.
In the first place, a signature describes not only shots and other short segments as the most important feature, but also enables to describe much longer segments such as the whole of a scene or all the video data.
And the r segments required to characterize long object segments are normally limited in number. In other words, a signature enables to characterize segments by a small amount of data.
In addition, in a signature, the weight allocated to each r segment shows the importance or correlation of each r segment and thus enables to identify object segments.
Moreover, in view of the fact that not only frames but also shots, scenes and any other segments can be used as r segments, a signature is nothing but a generalized concept resulting from the expansion of the so-called “key frame.”
When a segment can be broken down into a set of more simple sub-segments, these sub-segments can be used as r segments.
Such a signature can be formed at the discretion of users through a computer-assisted user interface, but in most applications it is desirable that the same would be extracted automatically.
The following is a description of some of actual examples of signature.
In the first place, the image frame signature of shots is, as shown in
The shot signature of a scene is, as shown in
In addition, the usages of signatures are not limited to that of visual information. As shown in
And the signature is not only useful for describing short segments but can also be used to describe the whole video program. For example, it will be possible to distinguish a specific TV program from the other TV programs by adequately choosing a plurality of shots. Such shots are repeatedly used in the above-mentioned TV program. For example, the beginning logo shot in a news program and a shot showing the news caster as shown in
An image voice processing apparatus
This image voice processing apparatus
To begin with, the image voice processing apparatus
The image voice processing apparatus
Thus, the image voice processing apparatus
Then in step S
Like in step S
Here is an explanation on feature. The term “feature” shows the feature of a segment and also a segment attribute that provides data for measuring the similarity among different segments. The image-voice processing apparatus does not depend on specific details of any feature. However, the features considered effective for use with the image-voice processing apparatus
There are a large number of known image features. They include, for example, color features (histograms) and image correlation.
Color in image is known to be an important aspect for judging whether two images are similar or not. The use of color histogram to measure image similarity is well-known as described for example in “G. Ahanger and T. D. C. Little, A survey of technologies for parsing and indexing digital video. J. of Visual Communication and Image Representation 7:28-4, 1996.” A color histogram divides, for example, HSV, RGB and other 3-dimensional color spaces into n regions and computes the relative proportion of pixels in images in each region. And the resulting information yields an n dimensional vector. For compressed video data, color histograms can be extracted directly from compressed data as described for example in U.S. Pat. No. 5,708,767.
In case where a histogram is to be extracted as a feature from sub-segments, image-voice processing apparatus
Such histograms capture the overall tone color of an image, but they lack time information. Therefore, the image-voice processing apparatus
Another feature different from the image feature mentioned above is one related to voice. Hereafter this feature shall be referred to as “voice feature.” The term “voice feature” means a feature that can show the contents of voice segments. Voice features include, for example, frequency analysis, pitch and level. These voice features are known by various documents.
One of such voice features is the distribution of frequency information in a single voice frame that can be obtained by means of a frequency analysis including the Fourier transform. To show the distribution of frequency information throughout a single voice sub-segment, the image-voice processing apparatus
In addition, the image voice processing apparatus
Another feature that can be mentioned here is the common image-voice feature. This is neither image feature nor voice feature, but this gives useful information to show the features of sub-segments in the image-voice processing apparatus
The image-voice processing apparatus
And the image-voice processing apparatus
This activity is computed indirectly by measuring the average value of the dissimilarity among frames of color histogram and other features. Now, if the standard of measuring dissimilarity for the feature F measured between the frame i and the frame j is defined as d
In this formula (1), b and f are the frame numbers of the first and last frames in a segment. In specific terms, the image-voice processing apparatus
The image voice processing apparatus
In the meanwhile, the standard of measuring dissimilarity which is a function that computes actual values of measuring the similarity between two sub-segments will be discussed later.
Then in step S
In specific terms, the image-voice processing apparatus
In the image-voice processing apparatus
On the other hand, the image-voice processing apparatus
In this way, the image-voice processing apparatus
Then in step S
In specific terms, the image-voice processing apparatus
In this manner, the image-voice processing apparatus
And in step S
The image-voice processing apparatus
In order to describe more specifically such a series of processes, an example of extracting shot signatures related to a scene shown in
This scene shows a scene of two persons talking each other, and begins with a shot showing both of the two speakers followed by shots in which the two persons appear alternatively as speakers.
In such a scene, in step S
Then in step S
And in step S
And in step S
And in step S
Thus, the image-voice processing apparatus
Then, the method of comparing the similarity of two segments by using signatures extracted will be described. In specific terms, the similarity of two segments is defined as a similarity of signatures based on r segments. Here in actual application, it is necessary to pay attention to the fact that the standard of measuring dissimilarity mentioned above or the standard of measuring similarity is defined.
Here, P=((r
Now, here is an explanation on the standards of measuring dissimilarity. The standard of measuring dissimilarity indicates that the two segments are similar when its value is small, and that they are dissimilar when its value is large. The standard of measuring dissimilarity d
Incidentally, some standards for measuring dissimilarity are applicable only to certain specific features. However, as described in “G. Ahanger and T. D. C. Little, A survey of technologies for parsing and indexing digital video, J. of Visual Communication and Image Representation 7:28-4, 1996” and “L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, John-Wiley and Sons, 1990,” generally speaking many standards for measuring dissimilarity are applicable for measuring similarity for features shown as points in n dimensional space. Concrete examples are Eucledean distance, inner product, and L1 distance. In view of the fact that L1 distance operates particularly effectively in various features including histogram, image correlation and features, the image-voice processing apparatus introduces L1 distance. Here, when two n dimensional vectors are represented by A and B, the L1 distance between A and B or dL1 (A,B) will be given by the following formula (3):
Here, the subscript i represents the respective first element of n dimensional vectors A and B.
As standards for measuring dissimilarity, several others are known in addition to the one mentioned above. But their detailed description is omitted here. The image-voice processing apparatus
In the first place, in the first method, the image-voice processing apparatus
And as the second method, the image-voice processing apparatus
Then as the third method, the image-voice processing apparatus
Furthermore, as the fourth method, the image-voice processing apparatus
Incidentally, for applying this formula (7), the restrictive conditions shown in the following formula (8) must be fulfilled.
The image-voice processing apparatus
The image-voice processing apparatus
In so doing, the image-voice processing apparatus
As described above, the image-voice processing apparatus
It should be noted in passing that this invention is not limited to the mode of carrying out mentioned above. For example, the feature used in grouping together mutually similar sub-segments can obviously be other than those mentioned above. In other words, in this invention, it is sufficient that mutually related sub-segments can be grouped together based on certain information.
And needless to say this invention can be modified as the circumstances require to an extent consistent with the purpose of this invention.
As described in details above, the signal processing method related to this invention is a signal processing method that extracts signatures defined by the representative segments which are sub-segments representing the contents of segments and the weighting functions that allocate weight to these representative segments, and comprises a group selection step that selects object groups of signatures from among the groups obtained by classifying based on an arbitrary attribute of sub-segments, a representative segment selection step that selects a representative segment from among the groups selected at this group selection step and a weight computing step that computes the weight of the representative segment obtained at this representative segment selection step.
Therefore, the signal processing method relative to this invention can extract signatures related to segments, and can use such signatures to compare the similarity among mutually different segments independently of the hierarchy of segments in signals. Accordingly, the signal processing method related to this invention can search segments having desired contents based on similarity from among segments of various layers in various signals.
And the image-voice processing apparatus related to this invention is an image-voice processing apparatus that extracts signatures defined by the representative segments which are image and/or voice sub-segments representing the contents of image and/or voice segments from among the image and/or voice sub-segments contained in the image and/or voice segments constituting the video signals supplied and a weighting function that allocates weights to these representative segments and comprises an execution means that selects object groups of signatures from among the groups obtained by a classification based on an arbitrary attribute of image and/or voice sub-segments, selects a representative segment from among these selected groups and computes the weight of the representative segment obtained.
Therefore, the image-voice processing apparatus related to this invention can extract signatures related to image and/or voice segments, and can use these signatures to compare the similarity among image and/or voice segments irrespective of the hierarchy of mutually different image and/or voice segments. Accordingly, the image-voice processing apparatus related to this invention can search image and/or voice segments having the desired contents based on similarity from among image and/or voice segments of various layers in various video signals.