[0001] The present application claims the benefit of U.S. Provisional Patent Application Serial No. 60/341,674, entitled “Techniques and Tools for Video Encoding and Decoding,” filed Dec. 17, 2001, the disclosure of which is incorporated by reference. The following concurrently filed U.S. patent applications relate to the present application: 1) U.S. patent application Ser. No. aa/bbb,ccc, entitled, “Motion Compensation Loop With Filtering,” filed concurrently herewith; 2) U.S. patent application Ser. No. aa/bbb,ccc, entitled, “Spatial Extrapolation of Pixel Values in Intraframe Video Coding and Decoding,” filed concurrently herewith; and 3) U.S. patent application Ser. No. aa/bbb,ccc, entitled, “Sub-Block Transform Coding of Prediction Residuals,” filed concurrently herewith.
[0002] Techniques and tools for motion estimation and compensation are described. For example, a video encoder adaptively switches between different motion resolutions, which allows the encoder to select a suitable resolution for a particular video source or coding circumstances.
[0003] Digital video consumes large amounts of storage and transmission capacity. A typical raw digital video sequence includes 15 or 30 frames per second. Each frame can include tens or hundreds of thousands of pixels (also called pels). Each pixel represents a tiny element of the picture. In raw form, a computer commonly represents a pixel with 24 bits. Thus, the number of bits per second, or bitrate, of a typical raw digital video sequence can be 5 million bits/second or more.
[0004] Most computers and computer networks lack the resources to process raw digital video. For this reason, engineers use compression (also called coding or encoding) to reduce the bitrate of digital video. Compression can be lossless, in which quality of the video does not suffer but decreases in bitrate are limited by the complexity of the video. Or, compression can be lossy, in which quality of the video suffers but decreases in bitrate are more dramatic. Decompression reverses compression.
[0005] In general, video compression techniques include intraframe compression and interframe compression. Intraframe compression techniques compress individual frames, typically called I-frames, or key frames. Interframe compression techniques compress frames with reference to preceding and/or following frames, and are called typically called predicted frames, P-frames, or B-frames.
[0006] Microsoft Corporation's Windows Media Video, Version 7 [“WMV7”] includes a video encoder and a video decoder. The WMV7 encoder uses intraframe and interframe compression, and the WMV7 decoder uses intraframe and interframe decompression.
[0007] A. Intraframe Compression in WMV7
[0008]
[0009] The encoder then quantizes (
[0010] The encoder then prepares the 8×8 block of quantized DCT coefficients (
[0011] The encoder encodes the DC coefficient (
[0012] The entropy encoder can encode the left column or top row of AC coefficients as a differential from a corresponding column or row of the neighboring 8×8 block.
[0013] The encoder scans (
[0014] A key frame contributes much more to bitrate than a predicted frame. In low or mid-bitrate applications, key frames are often critical bottlenecks for performance, so efficient compression of key frames is critical.
[0015]
[0016] 1) Since the prediction is based on averages, the far edge of the neighboring block has the same influence on the predictor as the adjacent edge of the neighboring block, whereas intuitively the far edge should have a smaller influence.
[0017] 2) Only the average pixel value across the row (or column) is extrapolated.
[0018] 3) Diagonally oriented edges or lines that propagate from either predicting block (top or left) to the current block are not predicted adequately.
[0019] 4) When the predicting block is to the left, there is no enforcement of continuity between the last row of the top block and the first row of the extrapolated block.
[0020] B. Interframe Compression in WMV7
[0021] Interframe compression in the WMV7 encoder uses block-based motion compensated prediction coding followed by transform coding of the residual error.
[0022]
[0023] The WMV7 encoder splits a predicted frame into 8×8 blocks of pixels. Groups of 4 8×8 blocks form macroblocks. For each macroblock, a motion estimation process is performed. The motion estimation approximates the motion of the macroblock of pixels relative to a reference frame, for example, a previously coded, preceding frame. In
[0024] Motion estimation and compensation are effective compression techniques, but various previous motion estimation/compensation techniques (as in WMV7 and elsewhere) have several disadvantages, including:
[0025] 1) The resolution of the motion estimation (i.e., pixel, ½ pixel, ¼ pixel increments) does not adapt to the video source. For example, for different qualities of video source (clean vs. noisy), the video encoder uses the same resolution of motion estimation, which can hurt compression efficiency.
[0026] 2) For ¼ pixel motion estimation, the search strategy fails to adequately exploit previously completed computations to speed up searching.
[0027] 3) For ¼ pixel motion estimation, the search range is too large and inefficient. In particular, the horizontal resolution is the same as the vertical resolution in the search range, which does not match the motion characteristics of many video signals.
[0028] 4) For ¼ pixel motion estimation, the representation of motion vectors is inefficient to the extent bit allocation for horizontal movement is the same as bit allocation for vertical resolution.
[0029]
[0030] The encoder then quantizes (
[0031] The encoder then prepares the 8×8 block (
[0032] The encoder entropy encodes the scanned coefficients using a variation of run length coding (
[0033]
[0034] In summary of
[0035] The amount of change between the original and reconstructed frame is termed the distortion and the number of bits required to code the frame is termed the rate. The amount of distortion is roughly inversely proportional to the rate. In other words, coding a frame with fewer bits (greater compression) will result in greater distortion and vice versa. One of the goals of a video compression scheme is to try to improve the rate-distortion—in other words to try to achieve the same distortion using fewer bits (or the same bits and lower distortion).
[0036] Compression of prediction residuals as in WMV7 can dramatically reduce bitrate while slightly or moderately affecting quality, but the compression technique is less than optimal in some circumstances. The size of the frequency transform is the size of the prediction residual block (e.g., an 8×8 DCT for an 8×8 prediction residual). In some circumstances, this fails to exploit localization of error within the prediction residual block.
[0037] C. Post-processing with a Deblocking Filter in WMV7
[0038] For block-based video compression and decompression, quantization and other lossy processing stages introduce distortion that commonly shows up as blocky artifacts—perceptible discontinuities between blocks.
[0039] To reduce the perceptibility of blocky artifacts, the WMV7 decoder can process reconstructed frames with a deblocking filter. The deblocking filter smoothes the boundaries between blocks.
[0040] While the deblocking filter in WMV7 improves perceived video quality, it has several disadvantages. For example, the smoothing occurs only on reconstructed output in the decoder. Therefore, prediction processes such as motion estimation cannot take advantage of the smoothing. Moreover, the smoothing by the post-processing filter can be too extreme.
[0041] D. Standards for Video Compression and Decompression
[0042] Aside from WMV7, several international standards relate to video compression and decompression. These standards include the Motion Picture Experts Group [“MPEG”] 1, 2, and 4 standards and the H.261, H.262, and H.263 standards from the International Telecommunication Union [“ITU”]. Like WMV7, these standards use a combination of intraframe and interframe compression, although the standards typically differ from WMV7 in the details of the compression techniques used. For additional detail about the standards, see the standards' specifications themselves.
[0043] Given the critical importance of video compression and decompression to digital video, it is not surprising that video compression and decompression are richly developed fields. Whatever the benefits of previous video compression and decompression techniques, however, they do not have the advantages of the following techniques and tools.
[0044] In summary, the detailed description is directed to various techniques and tools for motion estimation and compensation. These techniques and tools address several of the disadvantages of motion estimation and compensation according to the prior art. The various techniques and tools can be used in combination or independently.
[0045] According to a first set of techniques and tools, a video encoder adaptively switches between multiple different motion resolutions, which allows the encoder to select a suitable resolution for a particular video source or coding circumstances. For example, the encoder adaptively switches between pixel, half-pixel, and quarter-pixel resolutions. The encoder can switch based upon a closed-loop decision involving actual coding with the different options, or based upon an open-loop estimation. The encoder switches resolutions on a frame-by-frame basis or other basis.
[0046] According to a second set of techniques and tools, a video encoder uses previously computed results from a first resolution motion estimation to speed up another resolution motion estimation. For example, in some circumstances, the encoder searches for a quarter-pixel motion vector around an integer-pixel motion vector that was also used in half-pixel motion estimation. Or, the encoder uses previously computed half-pixel location values in computation of quarter-pixel location values.
[0047] According to a third set of techniques and tools, a video encoder uses a search range with different directional resolutions. This allows the encoder and decoder to place greater emphasis on directions likely to have more motion, and to eliminate the calculation of numerous sub-pixel values in the search range. For example, the encoder uses a search range with quarter-pixel increments and resolution horizontally, and half-pixel increments and resolution vertically. The search range is effectively quarter the size of a full quarter-by-quarter-pixel search range, and the encoder eliminates calculation of many of the quarter-pixel location points.
[0048] According to a fourth set of techniques and tools, a video encoder uses a motion vector representation with different bit allocation for horizontal and vertical motion. This allows the encoder to reduce bitrate by eliminating resolution that is less essential to quality. For example, the encoder represents a quarter-pixel motion vector by adding 1 bit to a half-pixel motion vector code to indicate a corresponding quarter-pixel location.
[0049] Additional features and advantages will be made apparent from the following detailed description of different embodiments that proceeds with reference to the accompanying drawings.
[0050]
[0051]
[0052]
[0053]
[0054]
[0055]
[0056]
[0057]
[0058]
[0059]
[0060]
[0061]
[0062]
[0063] The present application relates to techniques and tools for video encoding and decoding. In various described embodiments, a video encoder incorporates techniques that improve the efficiency of interframe coding, a video decoder incorporates techniques that improve the efficiency of interframe decoding, and a bitstream format includes flags and other codes to incorporate the techniques.
[0064] The various techniques and tools can be used in combination or independently. Different embodiments implement one or more of the described techniques and tools.
[0065]
[0066] With reference to
[0067] A computing environment may have additional features. For example, the computing environment (
[0068] The storage (
[0069] The input device(s) (
[0070] The communication connection(s) (
[0071] The techniques and tools can be described in the general context of computer-executable readable media. Computer-readable media are any available media that can be accessed within a computing environment. By way of example, and not limitation, with the computing environment (
[0072] The techniques and tools can be described in the general context of computer-executable instructions, such as those included in program modules, being executed in a computing environment on a target real or virtual processor. Generally, program modules include routines, programs, libraries, objects, classes, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The functionality of the program modules may be combined or split between program modules as desired in various embodiments. Computer-executable instructions for program modules may be executed within a local or distributed computing environment.
[0073] For the sake of presentation, the detailed description uses terms like “determine,” “select,” “adjust,” and “apply” to describe computer operations in a computing environment. These terms are high-level abstractions for operations performed by a computer, and should not be confused with acts performed by a human being. The actual computer operations corresponding to these terms vary depending on implementation.
[0074]
[0075] The relationships shown between modules within the encoder and decoder indicate the main flow of information in the encoder and decoder; other relationships are not shown for the sake of simplicity. In particular,
[0076] The encoder (
[0077] Depending on implementation and the type of compression desired, modules of the encoder or decoder can be added, omitted, split into multiple modules, combined with other modules, and/or replaced with like modules. In alternative embodiments, encoder or decoders with different modules and/or other configurations of modules perform one or more of the described techniques.
[0078] A. Video Encoder
[0079]
[0080] The encoder system (
[0081] A predicted frame [also called p-frame, b-frame for bi-directional prediction, or inter-coded frame] is represented in terms of prediction (or difference) from one or more other frames. A prediction residual is the difference between what was predicted and the original frame. In contrast, a key frame [also called i-frame, intra-coded frame] is compressed without reference to other frames.
[0082] If the current frame (
[0083] A frequency transformer (
[0084] A quantizer (
[0085] When a reconstructed current frame is needed for subsequent motion estimation/compensation, an inverse quantizer (
[0086] The entropy coder (
[0087] The entropy coder (
[0088] The compressed video information (
[0089] Before or after the buffer (
[0090] B. Video Decoder
[0091]
[0092] The decoder system (
[0093] A buffer (
[0094] The entropy decoder (
[0095] If the frame (
[0096] When the decoder needs a reconstructed frame for subsequent motion compensation, the frame store (
[0097] An inverse quantizer (
[0098] An inverse frequency transformer (
[0099] In one or more embodiments, a video encoder exploits redundancies in typical still images in order to code the I-frame information using a smaller number of bits. For additional detail about intraframe encoding and decoding in some embodiments, see U.S. patent application Ser. No. aa/bbb,ccc, entitled “Spatial Extrapolation of Pixel Values in Intraframe Video Coding and Decoding,” filed concurrently herewith.
[0100] Inter-frame coding exploits temporal redundancy between frames to achieve compression. Temporal redundancy reduction uses previously coded frames as predictors when coding the current frame.
[0101] A. Motion Estimation
[0102] In one or more embodiments, a video encoder exploits temporal redundancies in typical video sequences in order to code the information using a smaller number of bits. The video encoder uses motion estimation/compensation of a macroblock or other set of pixels of a current frame with respect to a reference frame. A video decoder uses corresponding motion compensation. Various features of the motion estimation/compensation can be used in combination or independently. These features include, but are not limited to:
[0103] 1a) Adaptive switching of the resolution of motion estimation/compensation. For example, the resolution switches between quarter-pixel and half-pixel resolutions.
[0104] 1b) Adaptive switching of the resolution of motion estimation/compensation depending on a video source with a closed loop or open loop decision.
[0105] 1c) Adaptive switching of the resolution of motion estimation/compensation on a frame-by-frame basis or other basis.
[0106] 2a) Using previously computed results of a first motion resolution evaluation to speed up a second motion resolution evaluation.
[0107] 2b) Selectively using integer-pixel motion information from a first motion resolution evaluation to speed up a second motion resolution evaluation.
[0108] 2c) Using previously computed sub-pixel values from a first motion resolution evaluation to speed up a second motion resolution evaluation.
[0109] 3) Using a search range with different directional resolution for motion estimation. For example, the horizontal resolution of the search range is quarter pixel and the vertical resolution is half pixel. This speeds up motion estimation by skipping certain quarter-pixel locations.
[0110] 4) Using a motion information representation with different bit allocation for horizontal and vertical motion. For example, a video encoder uses an additional bit for motion information in the horizontal direction, compared to the vertical direction.
[0111] 5a) Using a resolution bit with a motion information representation for additional resolution of motion estimation/compensation. For example, a video encoder adds a bit to half-pixel motion information to differentiate between a half-pixel increment and a quarter-pixel increment. A video decoder receives the resolution bit.
[0112] 5b) Selectively using a resolution bit with a motion information representation for additional resolution of motion estimation/compensation. For example, a video encoder adds a bit to half-pixel motion information to differentiate between a half-pixel increment and a quarter-pixel increment only for half-pixel motion information, not integer-pixel motion information. A video decoder selectively receives the resolution bit.
[0113] For motion estimation, the video encoder establishes a search range within the reference frame. The video encoder can center the search range around a predicted location that is set based upon the motion information for neighboring sets of pixels. In some embodiments, the encoder uses a reduced coverage range for the higher resolution motion estimation (e.g., quarter-pixel motion estimation) to balance between the bits used to signal the higher resolution motion information and distortion reduction due to the higher resolution motion estimation. Most motions observed in TV and movie content tends to be dominated by finer horizontal motion than vertical motion. This is probably due to the fact that most camera movements tend to be more horizontal, since rapid vertical motion seems to make viewers dizzy. Taking advantage of this characteristic, the encoder uses higher resolution motion estimation/compensation that covers more horizontal locations than vertical locations. This strikes a balance between rate and distortion, and lowers the computational complexity of the motion information search process as well. In alternative embodiments, the search range has the same resolution horizontally and vertically.
[0114] Within the search range, the encoder finds a motion vector that parameterizes the motion of a macroblock or other set of pixels in the predicted frame. In some embodiments, with an efficient and low complexity method, the encoder computes and switches between higher sub-pixel accuracy and lower sub-pixel accuracy. In alternative embodiments, the encoder does not switch between resolutions for motion estimation/compensation. Instead of motion vectors (translations), the encoder can compute other types motion information to parameterize motion of a set of pixels between frames.
[0115] In one implementation, the encoder switches between quarter-pixel accuracy using a combination of four taps/two taps filter, and half-pixel accuracy using a two-tap filter. The encoder switches resolution of motion estimation/compensation on a per frame basis, per sequence basis, or other basis. The rationale behind this is that quarter-pixel motion compensation works well for very clean video sources (i.e., no noise), while half-pixel motion compensation handles noisy video sources (e.g., video from a cable feed) much better. This is due to the fact that the two-tap filter of the half-pixel motion compensation acts as a lowpass filter and tends to attenuate the noise. In contrast, the four-tap filter of the quarter-pixel motion compensation has some highpass effects so it can preserve the edges, but, unfortunately, it also tends to accentuate the noise. Other implementations use different filters.
[0116] After the encoder finds a motion vector or other motion information, the encoder outputs the information. For example, the encoder outputs entropy-coded data for the motion vector, motion vector differentials, or other motion information. In some embodiments, the encoder uses a motion vector with different bit allocation for horizontal and vertical motion. An extra bit adds quarter-pixel resolution horizontally to a half-pixel motion vector. The encoder saves bits by coding vertical motion vector at half-pixel accuracy. The encoder can add the bit only for half-pixel motion vectors, not for integer-pixel motion vectors, which further reduces the overall bitrate. In alternative embodiments, the encoder uses the same bit allocation for horizontal and vertical motions.
[0117] 1. Resolution Switching
[0118] In some embodiments, a video encoder switches resolution of motion estimation/compensation.
[0119] The encoder gets (
[0120] In one implementation, the encoder computes and evaluates motion vectors as shown in FIG
[0121] In a separate path, the encoder computes (
[0122] In another implementation, the encoder eliminates a computation of a motion vector at integer-pixel accuracy in many cases by computing motion vectors as shown in
[0123] Most of the time the integer-pixel portion of the MV
[0124] If the integer-pixel MV
[0125] Returning to
[0126] Otherwise, the encoder selects (
[0127] The cost functions are defined as follows:
[0128] where J
[0129] 2. Different Horizontal and Vertical Resolutions
[0130] In some embodiments, a video encoder uses a search range with different horizontal and vertical resolutions. For example, the horizontal resolution of the search range is quarter pixel and the vertical resolution of the search range is half pixel.
[0131] The encoder finds an integer-pixel accurate motion vector in a search range, for example, by searching at integer increments within the search range. In a region around the integer-pixel accurate motion vector, the encoder computes a sub-pixel accurate motion vector by evaluating motion vectors at sub-pixel locations in the region.
[0132]
[0133] In an implementation in which quarter-pixel resolution is indicated by adding an extra bit to half-pixel motion vectors, the quarter-pixel location to the right of the integer-pixel location is not searched as a valid location for a quarter-pixel motion vector, although a sub-pixel value is computed there for matching purposes. In other implementations, that quarter-pixel location is also searched and a different scheme is used to represent quarter-pixel motion vectors. In alternative embodiments, the encoder uses a different search pattern for quarter-pixel motion vectors.
[0134] The encoder generates values for sub-pixel locations by interpolation. In one implementation, for each searched location, the interpolation filter differs depending on the resolution chosen. For half-pixel resolution, the encoder uses a two-tap bilinear filter to generate the match, while for quarter-pixel resolution, the encoder uses a combination of four-tap and two-tap filters to generate the match.
[0135] For half-pixel resolution, the interpolation used in the three distinct half-pixel locations H
[0136] where iRndCtrl indicates rounding control and varies between 0 and 1 from frame to frame.
[0137] For quarter-pixel resolution, the interpolation used for the three distinct half-pixel locations H
[0138] where t
[0139] For the quarter-pixel resolution, the encoder also searches some of the quarter-pixel locations, as indicated by Q
[0140] Alternatively, the encoder uses filters with different numbers or magnitudes of taps. In general, bilinear interpolation smoothes the values, attenuating high frequency information, whereas bicubic interpolation preserves more high frequency information but can accentuate noise. Using two bilinear steps (one for half-pixel locations, the second for quarter-pixel locations) is simple, but can smooth the pixels too much for efficient motion estimation.
[0141] 3. Encoding and Decoding Motion Vector Information
[0142] In some embodiments, a video encoder uses different bit allocation for horizontal and vertical motion vectors. For example, the video encoder uses one or more extra bits to represent motion in one direction with finer resolution that motion in another direction. This allows the encoder to reduce bitrate for vertical resolution information that is less useful for compression, compared to systems that code motion information at quarter-pixel resolution both horizontally and vertically.
[0143] In one implementation, a video encoder uses an extra bit for quarter-pixel resolution of horizontal component motion vectors for macroblocks. For vertical component motion vectors, the video encoder uses half-pixel vertical component motion vectors. The video encoder can also use integer-pixel motion vectors. For example, the encoder outputs one or more entropy codes or another representation for a horizontal component motion vector and a vertical component motion vector. The encoder also outputs an additional bit that indicates a quarter-pixel horizontal increment. A value of 0 indicates no quarter-pixel increment and a value of 1 indicates a quarter-pixel increment, or vice versa. In this implementation, the use of the extra bit avoids the use of separate entropy code tables for quarter-pixel MVs/DMVs and half-pixel MVs/DMVs, and also adds little to bitrate.
[0144] In another implementation, a video encoder selectively uses the extra bit for quarter-pixel resolution of horizontal component motion vectors for macroblocks. The encoder adds the extra bit only if 1) quarter-pixel resolution is used for the frame and 2) at least one of the horizontal or vertical component motion vectors for a macroblock has half-pixel resolution. Thus, the extra bit is not used when quarter-pixel resolution is not used for a frame or when the motion vector for the macroblock is integer-pixel resolution, which reduces overall bitrate. Alternatively, the encoder adds the extra bit based upon other criteria.
[0145]
[0146] A decoder gets (
[0147] The decoder determines (
[0148] If the decoder expects additional motion vector resolution information, the decoder gets (
[0149] The decoder then reconstructs (
[0150] B. Coding of Prediction Residuals
[0151] Motion estimation is rarely perfect, and the video encoder uses prediction residuals to represent the differences between the original video information and the video information predicted using motion estimation. In one or more embodiments, a video encoder exploits redundancies in prediction residuals in order to code the information using a smaller number of bits. For additional detail about coding of prediction residuals in some embodiments, see U.S. patent application Ser. No. aa/bbb,ccc, entitled “Sub-Block Transform Coding of Prediction Residuals,” filed concurrently herewith.
[0152] C. Loop Filtering
[0153] Quantization and other lossy processing of prediction residuals can cause blocky artifacts in reference frames that are used for motion estimation/compensation for subsequent predicted frames. In one or more embodiments, a video encoder processes a reconstructed frame to reduce blocky artifacts prior to motion estimation using the reference frame. A video decoder processes the reconstructed frame to reduce blocky artifacts prior to motion compensation using the reference frame. With deblocking, a reference frame becomes a better reference candidate to encode the following frame. Thus, using the deblocking filter improves the quality of motion estimation/compensation, resulting in better prediction and lower bitrate for prediction residuals. For additional detail about using a deblocking filter in motion estimation/compensation in some embodiments, see U.S. patent application Ser. No. aa/bbb,ccc, entitled “Motion Compensation Loop With Filtering,” filed concurrently herewith.
[0154] Having described and illustrated the principles of our invention with reference to various embodiments, it will be recognized that the various embodiments can be modified in arrangement and detail without departing from such principles. It should be understood that the programs, processes, or methods described herein are not related or limited to any particular type of computing environment, unless indicated otherwise. Various types of general purpose or specialized computing environments may be used with or perform operations in accordance with the teachings described herein. Elements of embodiments shown in software may be implemented in hardware and vice versa.
[0155] In view of the many possible embodiments to which the principles of our invention may be applied, we claim as our invention all such embodiments as may come within the scope and spirit of the following claims and equivalents thereto.