Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

MPEG-4 Facial Animation: Estimating and Compressing Facial Animation Parameters (FAPs), Papers of Computer Science

Two key components for building a model-based video coding system for facial animation using the mpeg-4 standard. The first component is a method for estimating facial animation parameters (faps) based on the piecewise bézier volume deformation model (pbvd). The second component is various methods for encoding faps, including interpolation schemes, spatial compression using principal component analysis (pca), and temporal transformation coding. The document also includes experimental results.

Typology: Papers

Pre 2010

Uploaded on 11/08/2009

koofers-user-fmk
koofers-user-fmk 🇺🇸

10 documents

1 / 24

Toggle sidebar

Related documents


Partial preview of the text

Download MPEG-4 Facial Animation: Estimating and Compressing Facial Animation Parameters (FAPs) and more Papers Computer Science in PDF only on Docsity! 1 VISUAL ESTIMATION AND COMPRESSION OF FACIAL MOTION PARAMETERS  ELEMENTS OF A 3D MODEL-BASED VIDEO CODING SYSTEM Hai Tao Department of Computer Engineering University of California, Santa Cruz, CA 95063 Thomas S. Huang Image Processing and Formation Laboratory, Beckman Institute University of Illinois at Urbana-Champaign, Urbana, IL 61801 Abstract The MPEG4 standard supports the transmission and composition of facial animation with natural video by including a facial animation parameter (FAP) set that is defined based on the study of minimal facial actions and is closely related to muscle actions. The FAP set enables model-based representation of natural or synthetic talking head sequences and allows intelligible visual reproduction of facial expressions, emotions, and speech pronunciations at the receiver. This paper describes two key components we have developed for building a model-based video coding system: (1) a method for estimating FAP parameters based on our previously proposed piecewise Bézier volume deformation model (PBVD), and (2) various methods for encoding FAP parameters. PBVD is a linear deformation model suitable for both the synthesis and the analysis of facial images. Each FAP parameter is a basis function in this model. Experimental results on PBVD-based animation, model-based tracking, and spatial-temporal compression of FAP parameters are demonstrated in this paper 1 INTRODUCTION Many applications in human-computer interface, 3D games, model-based video coding, talking agent, and distance learning demand communication of talking head videos. There are at least two possible solutions 2 to this problem: transmitting the image pixel directly or transmitting the geometric model, the deformation method, and some deformation parameters to synthesize the images on the other end of the transmission. The latter approach is called the model-based video coding method [1,2,8,9]. As shown in Figure 1, a model-based facial image communication system consists of three main components: (a) analysis of nonrigid facial motions, (b) compression and transmission of the face geometry model and the motion parameters, and (c) synthesis of facial images. The core of a model-based coding system is the model of the object under consideration. Two key elements comprise of a face model: the geometric face model and the face deformation model. They are described in MPEG-4 SNHC by the facial definition parameters (FDP) and the facial animation parameters (FAP) respectively. Face Motion Analysis Face Motion Parameter Compression and Transmission Face Image Synthesis Figure 1. A model-based video coding system. The geometric model defines the shape and the texture of a face. It is usually in the form of a 3D polygonal mesh. We have developed system to obtain 3D mesh models of faces from 3D CyberWare scanner data (Figure 2). The deformation model, on the other hand, describes how the face changes its shape and is used to generate various dynamic effects for intelligible reproduction of facial expressions. Four categories of facial deformation models have been proposed. They are parameterized models [3], physical muscle models [4], free-form deformation models [10], and performance-driven animation models [11]. In the motion analysis phase, these models are applied as constraints that regulate the facial 5 TLPR ++ )( 0V (2) where 0V is the neutral facial mesh, R is the rotation decided by the three rotation angles Ω , and T is the 3D translation. (a) (b) Figure 3. (a) The PBVD volumes (b) the expression smile. Figure 4. Action units around the mouth region. Top: the control nodes. Bottom: the action units. 6 Figure 5. Expressions and visemes created using the PBVD model. The expressions and visemes (bottom row) and their corresponding control meshes (top row). The facial movements are, from left to right, neutral, anger, smile, vowel_or, and vowel_met. Figure 6. An animation sequence with smile and speech I am George McConkie. 2.2 PBVD model-based tracking algorithm 2.2.1 Video analysis of the facial movements Several algorithms for extracting face motion information from video sequences have been proposed [1,2,8,9]. Most of these methods are designed to detect action-unit level animation parameters. The 7 assumption is that the basic deformation model is already given and will not change. In this section, we propose a tracking algorithm in the same flavor but using the PBVD model. The algorithm adopts a coarse-to-fine framework to integrate the low-level motion field information with the high-level deformation constraints. Since the PBVD model is linear, an efficient optimization process using lease squares estimator is formulated to incrementally track the head poses and the facial movements. The derived motion parameters can be used for facial animation, expression recognition, and bimodal speech recognition. LSE model fitting Template matching Dd 2V̂Deform model PddTd ˆ,ˆ,ˆ Ω nnn TPLR ˆ)ˆ(ˆ 0 ++V nnn PT ˆ,ˆ,ˆ Ω           ∂++∂= Pd d Td PTTLPRMd nnn PTD ˆ ˆ ˆ |],,[/)])(([ˆ ˆ,ˆ,ˆ02 ΩΩ ΩVV 1−Z Figure 7. Block diagram of the model-based PBVD tracking system. 2.2.2 Model-based tracking using the PBVD model The changes of the motion parameters between two consecutive video frames are computed based on the motion field. The algorithm is shown in Figure 7. We assume that the camera is stationary. At the initialization stage, the face needs to be approximately frontal view so that the generic 3D model can be fitted. The inputs to the fitting algorithms are the positions of facial feature points, which are manually picked. All motion parameters are set to zeroes (i.e., 0)ˆ,ˆ,ˆ( 000 =Ω PT ), which means a neutral face is 10 adding these changes to )ˆ,ˆ,ˆ( nnn PT Ω , the estimated new motion parameters are derived as )ˆ,ˆ,ˆ( )0( 1 )0( 1 )0( 1 +++ Ω nnn PT . Similarly, changes of motion parameters are computed in the half-resolution images as )ˆ,ˆ,ˆ( )1()1()1( PddTd Ω based on the previous motion parameter estimation )ˆ,ˆ,ˆ( )0( 1 )0( 1 )0( 1 +++ Ω nnn PT . This process continues until the original resolution is reached. LSE model fitting Template matching Deform model )0()0()0( ˆ,ˆ,ˆ PddTd Ω nnn TPLR ˆ)ˆ(ˆ 0 ++V nnn PT ˆ,ˆ,ˆ Ω LSE model fitting Template matching Deform model )0( 1 )0( 10 )0( 1 ˆ)ˆ(ˆ +++ ++ nnn TPLR V )0( 1 )0( 1 )0( 1 ˆ,ˆ,ˆ +++ nnn PT Ω LSE model fitting Template matching Deform model )1( 1 )1( 1 )1( 1 ˆ,ˆ,ˆ +++ nnn PT Ω )1()1()1( ˆ,ˆ,ˆ PddTd Ω )2()2()2( ˆ,ˆ,ˆ PddTd Ω )1( 1 )1( 10 )1( 1 ˆ)ˆ(ˆ +++ ++ nnn TPLR V 111 ˆ,ˆ,ˆ +++ nnn PT Ω original frames 1/2-resolution frames 1/4-resolution frames Figure 8. The coarse-to-fine PBVD tracking algorithm. In this coarse-to-fine algorithm, motion vector computation can be achieved with smaller searching regions and smaller templates. In our implementation, for each motion vector, the number of multiplication is 4]3)77()55[( ××××× 700,14= , which is about seven times fewer than the model- based scheme. A more important property of this method is that, to certain extent, this coarse-to-fine framework integrates motion vector computation with high-level constraints. The computation of the motion parameter changes is based on the approximated motion parameters at low-resolution images. As the result, more robust tracking results are obtained. 11 2.2.4 Implementation and experimental results The PBVD model has been implemented on a SGI ONYX machine with a VTX graphics engine. Real- time tracking at 10 frame/s has been achieved using the coarse-to-fine framework. It has also been used for bimodal speech recognition and bimodal emotion recognition. Explanation-based method has been implemented to improve the facial image synthesis. In PBVD tracking algorithm, the choice of deformation units iD depends on the application. In a bimodal speech recognition application, 6 action units are used to describe the motions around the mouth. The tracking result for each frame is twelve parameters including the rotation, the translation, and the intensities of these action units. For the bimodal emotion recognition and the real-time tracking system, 12 action units are used. Users can design any set of deformation units for the tracking algorithm. These deformations can be either at expression level or at action unit level. Lip tracking results are shown in Figure 10. Figure 9 shows the results of the real-time tracker. Facial animation sequences are generated from the detected motion parameters. Figure 11 shows the original video frame and the synthesized results. The synthesized face model uses the initial video frame as the texture. The texture-mapped model is then deformed according to the motion parameters. 12 (a) (b) (c) (d) Figure 9. Snapshots of the real-time demonstration system. The ratio between the heights of the two color bars indicates the quality of the tracking result. (a) (b) (c) (d) Figure 10. Lip tracking for bimodal speech recognition. 15 Figure 12: FAP interpolation results for the expression Surprise. Only the expression FAP is sent for each frame, and the 27 low-level FAPs are interpolated. Each directed link in FIG represents a set of interpolation functions. Suppose 1F , 2F , …, nF are the FAPs in a parent node and 1f , 2f , …, mf are the FAPs in a child node. Then, there are m interpolation functions denoted as ).,...,,( ... ),,...,,( ),,...,,( 21 2122 2111 nmm n n FFFIf FFFIf FFFIf = = = (6) Each interpolation function ()tI is in a rational polynomial form ∑ ∏∑ ∏ − = = − = = = 1 0 1 1 0 1 21 )()(),...,,( P i n j m ji K i n j l jin ijij FbFcFFFI , (7) where K and P are the numbers of polynomial products, ic and ib are the coefficient of the ith product, and ijl and ijm are the power of jF in the ith product. The encoder should send an interpolation function table which contains all K , P , is , ic , ib , ijl , ijm to the decoder for each child FAP. Because rational polynomials form a complete functional space, any possible finite interpolation function can be represented in this form to any given precision. 16 Figure 12 illustrates the results of applying the FAP interpolation scheme to an MPEG4 test sequence. Only the magnitude of the expression parameter is transmitted and the rest of the FAPs are generated using the pre-transmitted interpolation method. 3.2 Compression of FAPs Using PCA The aforementioned FIT approach encodes the fixed relationships among FAPs valid in all individual frames. The original FAP data is represented by its smaller subset. A more general tool for exploiting both deterministic and statistical correlation among FAPs is the principal component analysis method (PCA) [13], which converts the original FAP data to a new compact form. This method is motivated by the observation that different parts of a human face are articulated harmoniously and, though fixed relations may be absent or difficult to derive, statistically strong correlation does exist. To apply PCA technique, major axes are computed as the eigenvectors of the covariance matrix computed from FAP vectors. Each FAP vector is formed by FAPs at a particular frame. The eigenvalues of the covariance matrix indicate the energy distribution. The major axes corresponding to significant eigenvalues form a new low-dimensional subspace. Compact representation is obtained by projecting the original FAP vector into this subspace. Re-projecting the compact representations back to the original FAP space produces good approximations of the original FAP vectors. This process is also called Karhunen Loeve transform (KLT) [14]. Suppose the original FAP vectors are 1v , 2v ,…, nv , and each iv is a column vector contains m FAPs in a particular frame. Then, the mm × covariance matrix is computed as: t i n i i vvvvn C )()( 1 1 1 −− − = ∑ = , (8) where v is the mean of iv . Since for most FAPs, the average positions are at neutral expression, or 0, the covariance matrix is: 17 t i n i ivvn C ∑ =− = 11 1 . (9) Since C is a nonnegative definite matrix, all eigenvalues are nonnegative real values. We denote the eigenvalues in a scending order as 1λ , 2λ , … , mλ and their corresponding igenvectors as 1u , 2u , … , mu . Suppose that the first k eigenvalues are significantly large, or that the percentage of energy ∑∑ === m i i k i i 11 λλα exceeds a certain threshold; these k eigenvectors then form a subspace that preserves most of the information in iv . Each iv is projected into this new subspace by performing a linear transformation i t k t t i v u u u q               = M 2 1 . (10) The derived k -dimensional vector iq is encoded and transmitted through the channel. To approximate iv from iq , the following linear transformation is performed at the decoder side: [ ] iki quuuv K21ˆ = . (11) PCA reduces the dimension of FAP data dramatically. Although some new components of iq may possess larger data ranges and need more bits for coding, still significant bit savings are achieved. It should be noted that the eigen-vectors kju j ,,1, K= , for each FAP sequence also need to be sent in the setup stage to ensure that the decoder correctly recovers iv ) . For a low-bandwidth system with limited resources for downloading, a set of universal major axes ju defined so that both encoder and decoder include this KLT and no explicit setup is necessary for each sequence. This universal transform can be obtained by applying PCA to large amounts of training data with various facial motions. 20 perceptual sensitivity to error. Because dc coefficient is the mean value of the segment and is prone to error, its quantization step is three times smaller than that of ac coefficients. For quantized dc coefficients, predictive coding method is applied between segments to take advantage of the correlation between them. In an intrasegment, the quantized dc coefficient value is directly stored. For intersegment, the quantized dc coefficient of the previous segment is used as a prediction and the prediction error is encoded using Huffman coding method. For the nonzero ac coefficients in each segment, their positions and values need to be encoded. To encode their positions, for each ac coefficient, a run-length encoder is applied to record the number of leading zeros. A special symbol is defined to indicate the last nonzero ac coefficient in a segment. Since the segment length is 16, possible run-length values range from 0 to 14. Therefore, taking the “end_of_segment” symbol into account, the Huffman table of the run-length encoder contains 16 symbols. The values of nonzero ac coefficients are then encoded using a Huffman encoder. As in a predictive FAP coding scheme, quantization steps need to be carefully assigned to each parameter. Since the property of DCT coefficients are different from that of the original data, different values need to be deduced. Again, empirical results are crucial for justifying these values. To further exploit the human perceptual properties, different quantization steps should be designed for ac coefficients of different frequencies. Subjective experiments need to be conducted on resulted animation. From careful examination, we proposed a set of DCT quantization steps, which is included in the MPEG- 4 visual committee draft [16]. 3.4 Reduction of FAP Spatial and Temporal Redundancy Compression methods in spatial domain (among FAPs) are orthogonal to methods in temporal domain. The first approach benefits from the correlation among FAPs in a single frame, whereas the latter one 21 takes advantage of the temporal correlation of each FAP parameter. Combining these two approaches results in a hybrid scheme that can achieve much better compression performance. Based on predefined rules, the FIT method allows inputs with different number of FAPs in each frame. For example, in one frame, FAP for raising the left eyebrow presents but FAP for right eyebrow does not; in another frame, FAP for raising right eyebrow appears but the FAP for left eyebrow does not. A FIT for left-right duplication will easily handle both situations and interprets both frames correctly. The PCA method, on the other hand, requires no a priori knowledge about the data but accepts only one type of inputs. The same set of FAPs must appear in all frames. Neither FIT nor PCA introduces temporal latency. Different applications may choose the appropriate one for FAP dimension reduction. Because predictive coding method can be lossless, it is the first candidate for temporal compression when fidelity is the major concern. However, because predictive coding only invests on de-correlating two consecutive frames, when FAP sampling rate is relatively high (≥ 10 Hz) and therefore strong correlation exist in each temporal segment, predictive method is much less efficient than DCT method. Figure 13 shows the compression performance of various coding methods. We concluded that a combination of the PCA method and the DCT method gives the best performance at very low bit rates while the DCT method is superior to other methods at higher bit rates. 22 Compression of Marco30 sequence using PC, PCA+PC, DCT, and PCA+DCT 15 20 25 30 35 40 45 50 0 10000 20000 30000 40000 50000 60000 70000 80000 Total Bits pc pca+pc dct pca+dct Figure 13: Compression results of the Marco30 sequence. 4 CONCLUSIONS In this paper, we described a model based video coding system based on the PBVD facial animation model. We described a PBVD model-based tracking algorithm and various compression techniques for encoding facial motion parameters. Our future research will focus on enabling more realistic facial animation and improving the accuracy and the robustness of the nonrigid motion estimation. ACKNOWLEDGMENTS This work was supported in part by the Army Research Laboratory Cooperative Agreement No. DAAL01-96-0003.
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved