| 5748789 | Transparent block skipping in object-based video coding systems | Lee et al. | 382/243 | |
| 6130913 | Video coding and video decoding apparatus for enlarging a decoded alpha-map signal in accordance with a reduction ratio setting information signal | Yamaguchi et al. | 375/240.25 | |
| 6167084 | Dynamic bit allocation for statistical multiplexing of compressed and uncompressed digital video signals | Wang et al. | 375/240.02 | |
| 6188728 | Block motion video coding and decoding | Hurst | 375/240.16 | |
| 6275536 | Implementation architectures of a multi-channel MPEG video transcoder using multiple programmable processors | Chen et al. | 375/240.25 | |
| 6404814 | Transcoding method and transcoder for transcoding a predictively-coded object-based picture signal to a predictively-coded block-based picture signal | Apostolopoulos | 375/240.12 |
The invention relates to coded signals that represent pictures using fewer bits than conventional picture signals and, in particular, to a transcoding method and transcoder that transcodes coded object-based picture signals to coded block-based picture signals to allow a conventional block-based picture signal decoder to decode the coded object-based picture signals.
Communication using picture signals that electronically represent still and moving pictures is becoming ubiquitous, together with the use of signal coding to increase the efficiency with which such signals can be transmitted and stored. Signal coding is crucial to overcome the many limitations that exist on transmission bandwidth and storage capacity. Most of the popular and successful conventional picture signal coding techniques, such as those known as JPEG, MPEG-1, MPEG-2, ITU H.261 and ITU H.263, code the original picture signal by subjecting it to block-based processing. In block-based processing, each picture is expressed as an array of picture elements (pixels), e.g., an array of 640×480 pixels, each of which has a pixel value. The pixel values collectively constitute the picture signal. The picture is divided into regularly-sized and located square or rectangular blocks of pixels. Processing, such as block discrete cosine transforms (block-DCT), block-based motion estimation and block-based motion compensation is then individually applied to the corresponding blocks of pixel values to code the picture signal. The picture is divided into blocks regardless of the sizes and shapes of the objects represented by the picture.
Recently, techniques have been developed for generating object-based picture signals that represent the picture as a number of objects arranged to form a scene. Techniques have also been proposed for coding such object-based picture signals, the foremost example of which is that embodied in the emerging MPEG-4 standard. In an object-based picture signal, a picture, which may be a single still picture, or one of a group of sequential still pictures constituting a moving picture, is decomposed into objects having arbitrary shapes, unlike the regularly-sized and located blocks of current block-based representations. Each object is represented by a portion of the picture signal. This technique provides a more natural decomposition of the picture signal that may enable a number of new functionalities, such as user interaction with the objects in the picture, greater content-creation flexibility, and potentially improved coding efficiency and fidelity. These advantages of representing pictures using object-based picture signals is likely to especially appeal to content creators.
Object-based picture signals require object-based coding techniques such as MPEG-4 to code, manipulate, and distribute them. However, an MPEG-4 decoder, which is required to decode a coded object-based picture signal, is inherently more complex than conventional block-based MPEG-1 or MPEG-2 decoders. Moreover, the spread of DVD, Digital TV and HDTV has put MPEG-2 decoders into widespread use. JPEG still picture decoders are also widely used. Therefore, for users who already have a JPEG or MPEG-2 decoder, and who do not want or cannot afford the additional functionalities offered by an object-based picture signal, the need arises to transcode the MPEG-4 object-based picture signal to an MPEG-2 block-based picture signal. A similar need exists with respect to still pictures. Moreover, while program content may be developed using object-based picture signals, it may be desirable to distribute the object-based content to people who only have conventional block-based decoders, such as the MPEG-2 decoders used in DVD, satellite and terrestrial digital television. Consequently, a need exists to be able to transcode coded object-based picture signals to coded block-based picture signals that are compatible with the standard decoders of such block-based coding techniques as JPEG, MPEG-1, MPEG-2, H.261 and H.263.
The input
The MPEG-2 encoder receives the conventional picture signal at its input
The conventional transcoder
Some approaches to transcoding conventional block-based picture signals in the coded domain are described by S. F. Chang and D. Messerschmitt in
However, none of the above-cited references describes a coded domain transcoder capable of transcoding a coded object-based picture signal to a coded block-based picture signal. What is needed, therefore, is a coded-domain transcoder capable of transcoding in real-time a coded object-based picture signal representing a still or moving picture into a corresponding coded block-bas based picture signal. What is also needed is such a coded-domain transcoder having modest and affordable hardware requirements.
The invention provides a transcoder for transcoding a coded object-based picture signal that represents a picture to a coded block-based picture signal that also represents the picture. The coded object-based picture signal may be an MPEG-4 picture signal, for example, and the coded block-based picture signal may be an MPEG-2 picture signal, for example. The transcoder comprises a culling module, a picture composer and a partial encoder. The culling module receives the coded object-based picture signal and culls signal portions from the coded object-based picture signal to generate a culled object-based picture signal. The signal portions culled are those that represent objects not visible in the picture. The picture composer receives the culled object-based picture signal, partially decodes selected portions of the culled object-based picture signal and generates from them blocks of a partially-coded block-based picture signal in which the blocks have different coding states. The partial encoder receives the partially-coded block-based picture signal and encodes the blocks of the partially-coded block-based picture signal to generate the coded block-based picture signal in which the blocks have a uniform coding state. The coded block-based picture signal is capable of being decoded by a conventional block-based decoder.
The culling module may include an object culling module that culls, from the coded object-based picture signal, signal portions that represent objects that are not present in the picture and signal portions that represent objects that are present in the picture but are hidden.
The object-based picture signal may include a scene descriptor that describes the arrangement of the objects in the picture and may additionally include a coded shape descriptor for each of the objects. The object culling module may use the scene descriptor to identify the signal portions that represent the objects not present in the picture and may decode the coded shape descriptors of the objects identified as being present in the picture to identify the signal portions that represent the objects that are present in the picture, but are hidden.
The object-based picture signal may additionally include, for each of the objects, an object descriptor comprising a coded amplitude descriptor including interior tiles and boundary tiles. The culling module may additionally include a tile culling module that culls, from the object-based picture signal, signal portions that represent interior tiles and boundary tiles that are hidden in the picture.
The culled object-based picture signal may include a culled amplitude descriptor for each object visible in the picture, and the picture composer may include a tile-oriented picture composition module, a shift, mask and merge module, processing modules and a processing selection module. The culled amplitude descriptor comprises tiles representing portions of the object visible in the picture. The tile-oriented picture composition module receives the culled object-based picture signal and identifies, for each tile of the culled amplitude descriptors, at least one block of the partially-coded block-based picture signal to which the tile contributes. The shift, mask and merge module calculates shift, mask and merge parameters for each tile. The processing modules are each capable of receiving the tile or tiles contributing to each block of the partially-coded block-based picture signal and of decoding the tile or tiles to the extent that allows the block-generating processing defined by the shift, mask and merge parameters to be applied to them. The processing modules are also capable of applying the block-generating processing defined by the respective shift, mask and merge parameters to the tile or tiles to generate the block. The processing modules are each capable of decoding the tile or tiles contributing to the block to a coding state that differs among the processing modules. The processing selection module selects one of the processing modules to generate the block of the partially-coded block-based picture signal and, hence, selects the coding state in which the block is generated.
Alternatively, the culled object-based picture signal may include an amplitude descriptor for each object visible in the picture and the picture composer may include a block-oriented picture composition module, a shift, mask and merge module, processing modules and a processing selection module. Each amplitude descriptor comprises tiles. The block-oriented picture composition module receives the culled object-based picture signal and identifies, for each block of the partially-coded picture signal, the tile or tiles of the culled object-based picture signal that contribute to the block. The shift, mask and merge module calculates shift, mask and merge parameters for tile that contributes to the block. The processing modules are each capable of receiving the tile or tiles that contribute to each block of the partially-coded block-based picture signal and of partially decoding the tile or tiles and applying thereto the respective shift, mask and merge parameters to generate the block. The processing modules are each capable of decoding the tile or tiles contributing to the block to a coding state that differs among the processing modules. The processing selection module selects one of the processing modules to generate the block of the partially-coded block-based picture signal, and, hence, the coding state in which the block is generated.
The invention also provides a method for transcoding a coded object-based picture signal representing a picture to a coded block-based picture signal representing the picture. In the method, signal portions that represent objects not visible in the picture are culled from the coded object-based picture signal to generate a culled object-based picture signal. Portions of the culled object-based picture signal are partially decoded and from them are generated blocks of a partially-coded block-based picture signal in which the blocks have different coding states. Finally, the blocks of the partially-coded block-based picture signal are re-encoded to generate the coded block-based picture signal in which the blocks have a uniform coding state.
Finally, the invention provides a computer-readable medium in which is fixed a computer program that instructs a computer to perform a transcoding operation in which a coded object-based picture signal representing a picture is transcoded to a coded block-based picture signal representing the picture. In the transcoding operation, signal portions that represent objects not visible in the picture are culled from the coded object-based picture signal to generate a culled object-based picture signal. Portions of the culled object-based picture signal are partially decoded and from them are generated blocks of a partially-coded block-based picture signal in which the blocks have different coding states. Finally, the blocks of the partially-coded block-based picture signal are re-encoded to generate the coded block-based picture signal in which the blocks have a uniform coding state.
Culling the signal portions that represent objects not visible in the picture may include culling signal portions that represent objects that are not present in the picture and culling signal portions that represent objects that are present in the picture, but are hidden.
The object-based picture signal may include a scene descriptor that describes an arrangement of the objects in the picture and may additionally include a coded shape descriptor for each object. Culling the signal portions that represent objects that are not present in the picture may include identifying, using the scene descriptor, the signal portions that represent the objects not present in the picture, and decoding the coded shape descriptors of the objects that the identifying operation identifies as present in the picture to generate respective shape descriptors. In culling the signal portions that represent objects that are present in the picture, but are hidden, the shape descriptors are used to identify the signal portions that represent the objects that are present in the picture, but are hidden.
The object-based picture signal may include an object descriptor for each object. The object descriptor comprises a coded amplitude descriptor including interior tiles and boundary tiles. Culling the signal portions that is represent objects that are present in the picture, but are hidden, may include culling, from the object-based picture signal, signal portions that represent interior tiles and boundary tiles that are hidden in the picture.
The culled object-based picture signal may include a culled amplitude descriptor for each object visible in the picture. The culled amplitude descriptor for each object comprises tiles that represent the portions of the object that are visible in the picture. In this case, in partially decoding portions of the culled object-based picture signal and generating from them the blocks of the partially-coded block-based picture signal, the at least one block of the partially-coded block-based picture signal to which each tile of the culled amplitude descriptors contributes is identified. Shift, mask and merge parameters are calculated for each tile. One of a predetermined number of coding states in which to generate each block of the partially-coded block-based picture signal is selected as a selected coding state. Finally, the tile or tiles that contribute to each block of the partially-coded block-based picture signal are decoded to the selected coding state and the block-generating processing defined by the respective shift, mask and merge parameters is applied to the tile or tiles in the selected coding state to generate the block in the selected coding state.
Alternatively, the culled object-based picture signal may include an amplitude descriptor for each object visible in the picture. The amplitude descriptor comprises tiles. In partially decoding selected portions of the culled object-based picture signal and generating from them the blocks of the partially-coded block-based picture signal, for each block of the partially-coded picture signal, the tile or tiles of the culled object-based picture signal that contribute to the block are identified. Shift, mask and merge parameters are calculated for each of the tile or tiles that contribute to the block. One of a predetermined number of coding states in which to generate each block of the partially-coded block-based picture signal is selected as a selected coding state. Finally, the tile or tiles contributing to the block of the partially-coded block-based picture signal are decoded to the selected coding state and the block-generating processing defined by the respective shift, mask and merge parameters is applied to the tile or tiles in the selected coding state to generate the block in the selected coding state.
The transcoder and transcoding method according to the invention and the transcoding program fixed in the computer-readable medium according to the invention cull portions of the coded object-based picture signal that represent objects that are not visible in the picture before generating the coded block-based picture signal. Compared with conventional transcoders, transcoding methods and transcoding programs, this reduces the processing resources required to process the coded object-based picture signal to generate the coded block-based picture signal, or enables other constraints, such as processing time, to be met more easily since the culled portions of the object-based picture signal are not processed further. Moreover, the transcoder, transcoding method and transcoding program according to the invention process the culled object-based picture signal to generate at least a fraction of the blocks of the coded block-based picture signal in a partially-coded state. Compared with conventional transcoders, transcoding methods and transcoding programs, this further reduces the processing resources required to generate the coded block-based picture signal, or enables other constraints, such as processing time, to be met even more easily. The transcoder, transcoding method and transcoding program according to the invention perform less decoding of the coded object-based picture signal, and perform less encoding to generate the coded block-based picture signal. Moreover, the reduced decoding and encoding applied to the coded object-based picture signal preserve more of the original encoding of the coded object-based picture signal in the block-based picture signal. This reduces the generational quality loss compared with conventional transcoders, transcoding methods and transcoding programs.
Before describing the invention in detail, the ways in which coded block-based and coded object-based picture signals represent still and moving pictures will be briefly described. The basics of a coder for encoding a picture signal representing a gray-scale still picture are described first. Then the additional processing needed to code a picture signal representing a color still picture will be described, followed by the processing required to code moving pictures. The coders to be described operate on a digital picture signal that represents a still picture, or a sequence of still pictures constituting a moving picture. Each picture is divided into a rectangular array of picture elements (pixels). For example, each frame of a conventional NTSC television signal is divided into an array of 640×480 pixels. The digital picture signal includes a grey-scale value, which is typically an eight-bit number, for each pixel. The grey scale values are conventionally arranged in raster-scan order starting at the top left-hand corner of the picture.
A conventional block-based coder for a gray-scale picture signal receives the digital picture signal and derives from the digital picture signal a coded picture signal that represents the picture using fewer bits. A conventional block-based coder for a gray-scale picture signal transforms the digital picture signal to another domain in which most of the signal energy is concentrated in a small fraction of the coefficients. Most commonly, the digital picture signal is partitioned into two-dimensional blocks of 8×8 pixels and the two-dimensional discrete cosine transform (DCT) of each block is calculated. This transform is often referred to as an 8×8 Block DCT. Other popular spatial transforms, such as lapped transforms and wavelet transforms, can alternatively be used.
The conventional block-based coder divides the digital picture signal into 8×8 blocks regardless of the content of the picture.
The block transform processing applied to each 8×8 block of the digital picture signal generates an 8×8 block of transform coefficients. The blocks of transform coefficients are then subject to quantizing and to entropy coding that includes run-length coding and Huffman coding. The transform coefficients are quantized by scaling each coefficient by an appropriate factor to account for the psycho-visual characteristics of the human vision system (HVS) and the number of bits available to represent the picture in its coded state. After scaling, each transform coefficient is quantized, which reduces the value of many of the coefficients to zero.
Runlength coding exploits the fact that the majority of the transform coefficients are quantized to zero, and reduces the number of bits required to represent the block of quantized transform coefficients by coding only the locations and amplitudes of the non-zero quantized coefficients. For example, the array of quantized transform coefficients is typically scanned in zig-zag order, and the number (i.e., the runlength) of consecutive zero-level coefficients before a non-zero level coefficient is coded. The runlength is followed by a code that represents the level of the non-zero coefficient.
Finally, the Huffman coding applied to the run-level pairs resulting from the runlength coding exploits the statistical properties of these quantities to reduce further the number of bits required to represent the blocks of quantized transform coefficients.
The processing just described can usually represent the blocks of the picture signal using substantially fewer bits than those required to represent the original blocks. However, the processing is less effective when applied to blocks, such as the block
Block-based coding is typically applied to a digital picture signal representing a color picture first by transforming the digital picture signal from the component color space, i.e., red, green, and blue (RGB) color space, to a luminance/chrominance color space. Examples of a luminance/chrominance color space include YIQ and YUV color spaces. This transformation reduces the correlation among the three color components and also enables subsequent processing to exploit the different response of the HVS to luminance and chrominance components. For example, the spatial resolution of each of the color difference components may be reduced to be one-half, vertically and horizontally, of that of the luminance component. Each of the luminance and chrominance components is then coded using an appropriately-tuned grayscale picture signal coder as described above.
A picture signal representing a moving picture represents a sequence of still pictures that are acquired and displayed in rapid succession to give the impression of continuous motion. The high picture rate necessary to achieve the illusion of smooth motion usually results in considerable temporal redundancy among consecutive pictures. Specifically, consecutive pictures may typically contain the same information physically displaced between adjacent pictures. To reduce the temporal redundancy, predictive processing is typically applied. In this, a previous picture is used as the basis for encoding the current picture so that the current picture can be coded by coding only the differences between the current picture and the previous picture.
The accuracy of the prediction, and hence the differences that need coding, is greatly improved by accounting for the motion between consecutive pictures. The motion between the previous picture and the current picture is estimated, and then the reference picture that forms the basis for coding the current picture is formed by compensating the previous picture for the motion between the two pictures. The process of estimating the motion between the pictures is called motion estimation, or ME, and the process of forming a prediction of the current picture by applying motion compensation to the previous picture is called motion-compensated prediction, or MC-P.
The errors resulting from applying motion-compensated prediction to the current picture, called the MC-residual, is coded using a block-based picture signal coder of the type described above. However, different techniques are used to quantize the MC-residual to account for the different spectral distributions in a block of transform coefficients derived from a block of a digital picture signal and one derived from a block of the MC-residual.
ME/MC-P is typically applied to a picture signal representing a moving picture by partitioning each picture into square blocks of pixel values and applying ME/MC-P to each block. Each block, called a macroblock, typically has twice the linear dimensions of the blocks to which the DCT is applied. ME is a computationally intensive process since it involves performing matching operations between the current macroblock in the current picture, and all macroblocks located within ±n pixels of the corresponding position in the previous picture. Some techniques extend the matching operation to synthesized macroblocks displaced by one-half of a pixel from the actual macroblocks. The matching operations determine the location of the macroblock in the previous picture that is most similar to the current macroblock. Even when n is as small as one, this involves nine matching operations, and the number of matching operations increases rapidly as n increases above one.
In a picture signal representing a moving picture composed of a group of sequentially-acquired pictures each represented by a portion of the picture signal, the picture signal portion representing the first picture in the group is normally coded as a still picture. In other words, the coding applied to this picture signal portion is independent of the coding applied to the picture signal portions representing any of the other pictures in the group. The resulting coded picture signal is then decoded, and the resulting decoded picture signal represents a reference picture that is used to apply predictive MC processing to the picture signal portions representing the remaining pictures in the group. Bi-directional MC-processing can be used in addition to the forward MC-processing just described. In bi-directional MC-processing, coded picture signal portions representing pictures earlier and later in the sequence than the current picture are used as reference pictures to apply bi-directional MC-processing to the current picture. Regardless of whether forward or bidirectional prediction is used, if coding the MC-residual requires more bits than independently coding the original picture signal portion, then the MC-processing is turned off and the picture signal portion is coded as a still picture with no prediction. Alternatively, MC-processing may be turned off only for those macroblocks that require more bits to code than the number of bits required to code them independently.
More detailed descriptions and analysis can be found in a number of sources, e.g., J. L. Mitchell, W. Pennebaker, C. Fogg and D. LeGall, MPEG VIDEO COMPRESSION STANDARD, Chapman & Hall (1997). The techniques described above form the basis for a number of international standards for coding picture signals representing still and moving pictures. These standards include the JPEG still picture coding standard and the MPEG-1, MPEG-2, CCITT H.261, and ITU H.263 moving picture coding standards.
The coding techniques described above for picture signals representing still and moving picture involve block-based or overlapped block-based processing. Each picture is partitioned into blocks, which may overlap, and each block is processed independently. All conventional block-DCT, lapped transform, and wavelet based coding techniques for picture signals representing still and moving pictures can be regarded as being block-based or overlapped block-based. Block-based processing is advantageous in that it provides acceptable performance and is architecturally simple to implement. However, block-based coding schemes do not exploit, and in fact totally neglect, the actual content of the picture. In effect, block-based coding schemes implicitly assume that the original picture is composed of still or moving square blocks, which is unlikely in practice. Consequently, block-based coding schemes impose an artificial structure on the picture and then try to code this structure, as opposed to recognizing the structure inherent in the picture and attempting to exploit this structure to increase the efficiency with which the picture signal representing the picture is coded.
The efficiency with which a picture signal is coded may be improved by identifying and efficiently representing the inherent structure of the picture. For example, since pictures usually contain objects, the picture may be represented in terms or two- or three-dimensional objects, which may be still or moving. This approach usually represents the actual structure of the picture more accurately than the block-based coding schemes described above.
A moving picture can be decomposed into 3-D objects or regions in a number of different ways. One way identifies the 3-D objects in the scene and tracks each 3-D object with time, but this approach is typically complex to implement. An alternative way identifies the 2-D objects or regions in a single picture at the start of the group of pictures, and tracks the evolution of the 2-D objects with time. A more practical way tracks the evolution with time of a signal representing a 2-D object, and specifically tracks how both the amplitude in the region and the shape of the region change with time.
Representing the picture
An object-based representation of a picture also provides the ability to code the object-based picture signal with an improved coding efficiency. For example, object-based representation provides a highly-accurate definition of each object's shape and motion in a group of sequentially-acquired pictures. This can provide a significant gain in MC-prediction performance. If the interior of an object has homogeneous characteristics, the homogeneity may be exploited to increase the efficiency with which the interior of the object is coded. Also, the artifacts that result from decoding a coded object-based picture signal may be less visible than the highly-structured and artificial blocking and mosquito noise artifacts that occur when a coded block-based picture signal is decoded.
Coding schemes for coding object-based picture signals are currently the topic of considerable research both within the general research community and within the MPEG-4 standardization process, see, for example, 7 IEEE Trans. on Circuits and Systems for Video Techology: Special Issue on MPEG-4 (1997 February), and MPEG-4 Overview at http://drogo.stst.it/mpeg/standards-/mpeg-4/mpeg-4.htm (1999 March).
While several different coding schemes for object-based picture signals may co-exist in the future, current standardization efforts are focused on the emerging MPEG-4 standard. MPEG-4 defines a standard for coding object-based picture signals representing still and moving pictures. Although the standard is not yet fully defined, a basic framework has been decided. The attributes of the basic framework of the proposed MPEG-4 standard can be summarized as follows. In a manner analogous to MPEG-1 and MPEG-2, the proposed MPEG-4 standard only specifies the bitstream of the coded picture signal and the characteristics of a standard decoder, and does not specify the encoder. For example, the way segmentation is applied to picture signal representing a still or moving picture to divide the picture into objects is not defined in the standard.
In the proposed MPEG-4 standard, each picture is expressed as a number of 2-D objects having arbitrary, i.e., non-rectangular, shapes. The objects are arranged in a scene. The picture may depict all of the scene, but, when the picture is a moving picture, or when interactivity is provided, the picture will more likely depict only part of the scene. The picture signal representing the picture is composed of two main portions, a scene descriptor and, for each object in the scene, an object descriptor. The scene descriptor lists the objects in the scene and describes how the objects are arranged in the scene. Each object in the scene is described by a respective object descriptor. In a still picture, the object descriptor has two main components, a shape descriptor that defines the shape of the object and an amplitude descriptor that describes the amplitude of the object, i.e., the appearance of the object. For example, in the object-based picture
To simplify the processing, each arbitrarily-shaped object is placed in a bounding rectangle that has sides that are integral multiples of 16 pixels. The shape descriptor and the amplitude descriptor of the object are expressed in terms of the coordinate system of the bounding rectangle. The coordinate system of the bounding rectangle may be the same as, but is more often different from, the coordinate system of the picture. The position of the bounding rectangle in the picture is defined by a translation expressed in the coordinate system of the picture.
In the object descriptor, the shape descriptor represents the shape of an object in terms of a mask applied to the bounding rectangle
The shape descriptor may be coded in a number of different ways. One way partitions the bounding rectangle into separate 16×16-pixel macroblocks. First, each macroblock is classified as an exterior macroblock, an interior macroblock or a boundary macroblock respectively located entirely outside, entirely inside, and part-inside and part-outside the support of the object. The interior blocks are all opaque if the object is opaque and the exterior blocks are all transparent, and can be efficiently coded as such. The boundary blocks are coded using a context-based arithmetic coder. In the example shown, all of the 16-pixel macroblocks located in the boundary rectangle
In a moving picture, the evolution of the object's shape descriptor from one picture to the next can be coded by applying motion compensation to the shape descriptor using a field of shape motion vectors followed by a context-based arithmetic coder with an appropriately chosen context.
The amplitude descriptor of the object inside its support can be coded in a number of different ways. For example, the bounding rectangle
The evolution of the object's amplitude in a moving picture can be coded using motion-compensated prediction as in block-based moving picture signal coding. MC-P can be performed using block-based motion estimation or using parametric motion estimation, which is more sophisticated. In an exemplary type of block-based motion estimation, the bounding rectangle is divided into 16×16-pixel macroblocks and a motion estimation operation is performed on each macroblock to identify the appropriate prediction macroblock in the object in the previous picture. To improve the MC-P performance, the object in the previous picture can be extrapolated to fill the bounding rectangle. Additional features may include the ability to switch from motion estimation using 16×16 macroblocks to motion estimation using 8×8 blocks, and may additionally include the ability to use overlapped-block MCP.
Block-based motion estimation models the moving picture as an array of moving square blocks in which the translation of each block is uniform throughout the block. Parametric motion estimation recognizes that the way the appearance of the object changes from one picture to the next can involve changes more complex than uniform translation. Such changes may include rotation, scaling, and perspective, in addition to translation. The parametric motion estimation capability of MPEG-4 enables the motion of the object to be described using more sophisticated motion models, such as affine or perspective. In this case, the object in the previous picture is transformed using the appropriate motion model, appropriate interpolation to the sampling grid is performed, and then prediction is performed using the transformed object as the reference. Parametric motion models may be applied to the entire object, or the object may be partitioned into separate regions and an appropriate motion model is applied to each region.
Parametric motion models may arise naturally in a number of cases. For example, when a moving picture is synthesized, as can be done using computer graphics, the change in an object from one picture to the next may be explicitly defined by a warping function. The warping function and its parameters may be communicated directly to the coder or the coder may estimate the warping function. In another example, in some sequences of moving pictures, the same background may be common to many pictures. However, the appearance of the background may change as the result of camera motion and occlusions by objects located between the camera and the background. In these cases, it can be beneficial to code the background, which is often called a sprite; to transmit the picture signal portion representing the background only once at the beginning of the sequence and to code the picture-to-picture changes in the background using a motion model that takes account of the camera motion. Using the appropriate motion model can greatly increase the effectiveness of the motion estimation, and, hence, the coding efficiency.
The proposed MPEG-4 standard is designed to allow some operations to be performed in the coded domain. For example, individual objects may be added or deleted by modifying the scene descriptor, and by adding or dropping the portion of the MPEG-4 bitstream that represents the object description of the object. However, the proposed MPEG-4 standard does not allow many other operations that it would be desirable to perform in the coded domain to be performed in the coded domain. Such operations include transcoding between MPEG-4 and MPEG-2 or other block-based standards.
This disclosure will describe embodiments of a transcoder, a transcoding method and a transcoding program that transcode an MPEG-4 picture signal representing a single picture to an MPEG-2 picture signal that represents the single picture. However, the embodiments described herein can easily be adapted to transcode an MPEG-4 picture signal to other types of coded block-based picture signal, to transcode other types of coded object-based picture signal to an MPEG-2 picture signal, and to transcode other types of coded object-based picture signals to other types of block-based picture signals.
The method according to the invention starts at step
In step
In step
The transcoding method ends at step
The transcoder
1. Only the portions of the MPEG-4 picture signal that represent objects visible in the picture are processed to generate the MPEG-2 picture signal;
2. The portions of the MPEG-4 picture signal that are processed are decoded only to the extent that allows the block-generating processing that generates the respective blocks of the MPEG-2 picture signal to be applied to them; and
3. The different types of tile constituting part of the MPEG-4 picture signal are adaptively decoded using the type of decoding most appropriate for the type of tile.
An MPEG-4 object-based picture signal often includes a substantial amount of additional information beyond that required to generate a given picture. For example, the MPEG-4 picture signal may include signal portions that represent objects that are not visible in the picture. Such objects include objects that are present in the scene but are not present in the picture, and objects or parts of objects that are present in the picture but are hidden by other objects. User interaction can cause such objects to become visible by bringing the objects into the picture, by changing the location of the picture in the scene or by moving the objects that hide the hidden objects, for example. Block-based picture signals do not offer interactivity, and only represent the objects and parts of objects that are actually visible in the picture. To save having to expend processing resources on transcoding the portions of the MPEG-4 picture signal that represent objects and parts of objects that are not visible, the transcoder
The culled MPEG-4 picture signal
A block-based picture signal, such as an MPEG-2 picture signal, represents a picture in terms of a fixed two-dimensional array of pixel blocks, as described above. The objects defined by an object-based picture signal, such as an MPEG-4 picture signal, are also expressed in terms of blocks of pixels. Such blocks, and blocks of parameters such as DCT coefficients derived from such blocks, will be called tiles in this disclosure to distinguish them from the blocks of the block-based picture signal. However, each object represented by a portion of the MPEG-4 picture signal can have its own coordinate system and pixel size that are rarely congruent with the coordinate system and pixel size of the picture as represented by the block-based picture signal.
Composing the block-based picture signal involves identifying the block, blocks or part of a block to which each tile of the culled MPEG-4 picture signal contributes and performing operations such as translating, scaling, and rotating to transfer the tiles from the coordinate systems of the objects to which they belong to the coordinate system of the picture. Moreover, when a block includes contributions from more than one tile, which especially occurs when the tile is a boundary tile or a partly-hidden interior tile, additional operations such as masking and merging are applied to the tiles to generate the block with the respective contributions from each of the tiles.
In this disclosure, references to tiles of the object-based picture signal contributing to a block of the block-based picture signal will be understood to encompass the case in which a single tile contributes the entire block, and references to identifying the blocks of the block-based picture signal to which a tile of the object-based picture signal contributes will be understood to encompass the case in which the tile contributes to a single block. Moreover, references to a block of the block-based picture signal being generated from tiles of the object-based picture signal will be understood to encompass the case in which the block is generated from a single tile.
To perform the processing required to transcode the MPEG-4 picture signal to the MPEG-2 picture signal with the least demand for processing resources or to comply with other constraints such as processing time, the block-based picture composer
The block-based picture composer
The partially-coded block-based picture signal
Finally, the fully-coded block-based picture signal
The embodiments of the transcoders and the modules thereof described in this disclosure may be constructed from discrete components, small-scale or large-scale integrated circuits, suitably-configured ASICs and other suitable hardware. Alternatively, the embodiments of the transcoder and the modules thereof may be constructed using a digital signal processor, microprocessor, microcomputer or computer with internal or external memory operating in response to a program such as the transcoding program fixed in the computer-readable medium according to the invention. In computer- and DSP-based embodiments, the various modules shown herein may be ephemeral, and may only exist temporarily as the program executes. In such embodiments, the transcoding program could be conveyed to the hardware on which it is to run by embodying the program in a suitable computer-readable medium, such as a set of floppy disks, a CD-ROM, a DVD-ROM, or could be transmitted to such hardware by a suitable data link. Moreover, the modules shown in this disclosure may operate autonomously, in which case, the various controllers shown may be unnecessary. The modules may also include memory, or may operate with a common memory. In these cases, the various cache memories shown may be unnecessary.
The modules of the transcoder
The object culling module
The visibility table
The visibility table
a list of identifiers identifying all the objects present at the spatial location,
depth information for each object, or other information useable to determine which of the objects is closest to the camera, and
information indicating whether each object exists over the entire spatial location or if the object exists only over part of the spatial location.
In some applications, it may be advantageous for the visibility table additionally to include an auxiliary visibility table for each object. The auxiliary visibility table includes, for each of the spatial locations in the picture, an entry composed of a single bit whose state indicates whether or not the object is visible at the corresponding spatial location. The spatial locations correspond to those represented by the main visibility table. Such information is implicitly included in the main visibility table, but it may be useful to create a separate auxiliary visibility table for each object where this information is explicitly available. The auxiliary visibility table can be generated as each object is being processed. The auxiliary visibility table itself is relatively small, since each entry can be composed of a single bit.
The visible object detector
The processing performed by the embodiment of the object culling module
Processing starts at step
In step
In step
In step
In step
In step
In step
In step
In step
In step
In step
In step
In step
In step
In step
In step
The tile culling module
The tile cache
The tile-type extraction module
The visibility information generator
The hidden tile detector
The reference tile test module
The partial decoder
The decoded parameter cache
In step
In step
In step
In step
In step
In step
In step
In step
In step
In step
In step
In step
In step
In step