IEEE International Conference on Multimedia & Expo 2000
New York, New York
Electronic Proceedings
© 2000 IEEE


Partial Update of Active Textures for Efficient Expression Synthesis in Model-Based Coding

Lijun Yin and Anup Basu
Department of Computing Science
University of Alberta, Edmonton, AB
T6G 2H1, Canada

Email: {lijun, anup}@cs.ualberta.ca
http://www.cs.ualberta.ca/~lijun
http://www.cs.ualberta.ca/~anup

Abstract

An image-based facial expression synthesis method is presented in this paper for a realistic face animation in model based coding. Differing from the conventional whole texture update method, we propose a partial texture update scheme in Active Texture areas to reduce the bitrate efficiently. Facial expressions are synthesized using temporal blending and spatial filtering so that a smooth texture fusion can be achieved. Active texture extraction, compression and composition are the principal steps in realizing the proposed scheme. The experiments on video sequences demonstrate the advantage of the proposed algorithm in low cost transmission with high fidelity reproduction. The life-like expressive human faces with wrinkles are generated at a bitrate of less than 40kb/s.


Table of Contents


1. Introduction

Image-based texture synthesis has recently attracted more attention for creating a realistic object. Several works have been reported in the scenario of SNHC formed within MPEG4 [7][1], multi-texture mapping and eigen texture analysis [10][11]. One of its important applications is to generate a realistic face expression [3][6]. The vivid facial textures, such as furrows, are very difficult to capture and model. The most widely used facial action coding system (FACS) has also been shown to have difficulty producing a large variety of complicated facial expressions [8]. Due to the difficulties in current computer graphics technology, facial texture extraction and update are the best choice for the current application in model based coding. However, the entire texture updating using traditional waveform encoding methods ( e.g. , JPEG, MPEG) is an expensive way requiring high bit rates. Strom et al. [9] coded the entire face texture by employing the classical eigen face approach, while Guenter et al. [3] produced a face by capturing six camera views in high resolution. Both approaches generate an impressive image quality, however they require high bitrate ($>200kb/s$), high computation complexity and large storage space for textures. Choi et al. [4] encoded the texture in two folds: orthonormalization for global texture and simple quantization for local texture, very low transmission cost is achieved. However the local texture is updated at a constant rather than dynamic rate, no feature detection and active texture detection are addressed.

In this paper, we propose an efficient method to detect and update the textures of interest in active areas, and eventually synthesize the facial expressions using these captured textures and the first frame texture by a texture blending technique. Psychology research shows that a significant contribution to a realistic facial expression comes not only from facial organs (i.e., eyes, mouth), but also from facial wrinkles generated from the expressions [2] . The most significant wrinkles on a human face are categorized into four types: forehead, glabella, crows-feet and nasolabial, as shown in Figure 3(a). Textures of interest (TOI) are defined in the four types of wrinkle areas and mouth-eye areas as shown in Figure 3(c). Among the TOI, only the active textures (AT) which are generated along with the different expressions are used for image synthesis. After the model adaptation, active texture extraction and compression, the facial expression can be generated by composing the AT with the 1st frame texture using a temporal blending and spatial filtering technique. The system is outlined in Figure 1 .

[systemcomposition]
Figure 1. Flow chart of the system composition.

In Section 2, facial model adaptation is described. Section 3 explains the AT detection and compression. Face texture synthesis using a blending technique is described in Section 4 followed by the experiments of Section 5. Concluding remarks are given in Section 6.

<-- Back to Table of Contents>

2. Face model adaptation

In order to obtain the textures of interest, a 3D wireframe model is first matched onto an individual face to track the motion of facial expressions. A new dynamic mesh matching algorithm is employed. It consists of two stages:

2.1 Feature detection

The feature vertices on the model correspond to a number of feature points on a face located on the contour of eye, mouth, eyebrow, nostril, nose side and head silhouette. The active tracking algorithm for tracking the head silhouette and the deformable template matching using color (saturation, hue) information for organ's contours are developed in [13], eyebrow contour can be obtained by integration projection method, and the nostril, nose side and shape are estimated by a statistical correlation template matching [12]. Figure 2(b) shows the feature vertices of a model which correspond to the feature points detected on the face.
<-- Back to Section 2>

2.2 Coarse-to-fine dynamic mesh adaptation

(a) Coarse adaptation: Dynamic mesh (DM) is a well known approach for adaptive sampling of images and physically based modeling of non-rigid objects [5], it can be assembled from nodal points connected by adjustable springs based on the features of images. The fundamental equation is

[equation1]

where xi is the position of node i, fi is the external force acting on node i, gi is the internal force on node i due to the springs connected to neighboring node j. We take the feature points extracted as the mesh boundaries, and apply the above non-linear motion equation to make the large movement converge quickly to the region of interest.

(b) Fine adaptation: To overcome the convergence problem of numerical solution of Equation 1, an energy-oriented mesh method (EOM) to finely adjust the mesh is developed. EOM makes the mesh movement in the direction of mesh energy decrease instead of decreasing the node velocities and accelerations. The node energy is the sum of spring energy connected to the node. The strain energy Eij stored in a spring ij is

[equation2]
where r is the length of a deformed spring, l is the natural length, C is the stiffness. For each node, only the movements that decrease the node energy are allowed. The mesh gradually fits to the image features and makes the adaptation very accurate [14]. Figure 2 shows some examples of model adaptation.

[a] [b] [c] [d]
Figure 2. Extended dynamic mesh adaptation. (from left) (a)face image; (b)feature vertices; (c)fitting on face; (d) adapted model.

<-- Back to Section 2>
<-- Back to Table of Contents>

3. Active texture detection

After the facial model is fitted onto a face in a video sequence, the facial expressions are represented by a series of deformed facial models. As we discussed in Section 1 , a deformed facial model represents only a geometric structure of a facial expression, which mainly reflects the expression of the eyes and mouth areas. This is not enough to represent a realistic vivid expression of a face. In order to reconstruct a lifelike expression of a face, it is necessary to provide the texture information of the face corresponding to the individual expression. The fitted model sketches the accurate location of texture areas. Twenty-seven fiducial points are defined on the face for forming the textures of interest, as shown in Figure 3(a)(b).

[a] [b] [c] [d]
Figure 3. (from left) (a) Feature points defined on a human face; (b) An example of feature points extracted on a person's face; (c) textures of interest; (d) Normalization of textures of interest.

Since the size and shape of textures of interest vary with different expressions, the adapted facial model must be warped to a standard shape and size (i.e., the first frame). This is a non-linear texture warping. After the normalization, the subsequent faces with different expressions have the same shape, but different texture, than the first frame. Since the fiducial points are extracted in the previous stage, the TOI areas can be determined by the geometric relationship of these fiducial points. Figure 3(c) shows the textures of interest formed from the fiducial point locations. Figure 3(d) shows the TOI in wrinkle, eyes and mouth areas which are normalized to the standard shape. To represent the facial texture efficiently, only the active textures (AT), which are the typical textures representing significant facial expression, need to be transmitted. To extract the active texture, a temporal correlation thresholding algorithm is developed, in which the active texture shows a lower correlation value between the current frame and the first frame. The threshold value is obtained from the five sets of training sequences. The low temporal correlation value distinguishes the active texture among the textures of interest. For example, forehead wrinkle is detected as shown in Figure 4, in which the texture whose correlation value is less than the threshold (0.8) appears as the active wrinkle at the time of ``surprise.'' The extracted active texture can be represented by the composition of eigen textures for low bitrate transmission (detailed in [12] due to the space limitation).

[correlation]
Figure 4. Temporal correlation of wrinkle textures between the first frame and the subsequent frames. The input video shows different expressions: initial natural expression, eyes closing, smiling, laughing, worry/sad, surprise.

<-- Back to Table of Contents>

4. Facial texture synthesis

Since the facial texture for each frame is synthesized by the first frame image (for example, showing a natural expression) and the corresponding decoded active textures, simply texture-mapping is not good enough on a final appearance, especially on the boundary of the wrinkle texture, the sharp brightness transition is obviously seen. To overcome this drawback, a temporal blending method is applied to smooth the texture boundary, in which the brightness of the active texture is linearly blended with the image of the first frame on the boundary. The blending weight ranges from 0 to 1; the value being a function of the pixel position in the active texture. In the large central area of the texture, the weight is set to 1. In the transition area the value decreases from 1 to 0, as the texture gets closer to the boundary. The blended texture value is computed as follows:

[equation3&4]

where bk is the blended value of the kth frame active texture at the location (x, y). f1 and fk are the active texture values in frame 1 and frame k respectively. w is the blending weight as depicted in Figure 5. (xe, ye) is a variable position on the edge (borderline between the central area and the transition area). r is a factor for normalizing the values, here r is set to squareroot(2*Pi)*õ.

[a:definedfeature]
Figure 5. The weight value selection for texture blending.

The response of the human visual system tends to ``overshoot'' around the boundary of regions of different intensity. The result of this brightness perception is to make areas of constant intensity appear as if they had varying brightness. This is the well known Mach-band effect. To alleviate the Mach-band effect, in the transition area the weight value is determined by a Gaussian distribution centered at variable positions of (xe, ye) to make texture fusion smoother and the intensity difference around the boundary region smaller. The variance õ and constant c are adjustable, which are set here as õ=1.0 and c=0. Moreover, a spatial low-pass filter with a template 3*3 is applied to smooth the border. This temporal blending plus spatial filtering approach makes the two textures synthesized visually smoother, some examples are shown in Figure 6.

[a] [b] [c]
Figure 6. Close view of texture synthesis (Nasolabial): (from left) original frame25; no Blend&Filter; with Blend&Filter
<-- Back to Table of Contents>

5. Experimental results

Video sequence (Mario) showing different expressions is the input for our experiment, with a resolution of 509*461 pixels, and 2954 vertices on the 3D face model. The results of model adaptation, texture normalization and detected active texture are shown in Figure 7(*). To compress the detected active textures, the principal component transformation is applied to map the active texture into the eigen space. Since nine active areas on a face have small dimensions (e.g., 50*90 for forehead, 30*30 for glabella, 80*40 nasolabial, etc.), a small number of eigen textures are sufficient to describe the principal characteristics of each active texture, i.e., 12 eigen textures are selected for each AT on crow-feet and glabella areas, 25 eigen textures for each one of the remaining areas. Thus, if all nine TOI are active, 186 data (=25*6+12*3) need be transmitted, each data can be quantized to 5 bits; the data for each frame is then about 116 bytes which produces a SNR of 34.6dB. In the case of 30Hz frame rate, the bit cost for texture update is about 27kb/s. In the real situation, the texture update does not occur on every TOI, neither on every frame. Therefore, the bit cost is much less than the above maximum estimation. If the animation parameters are counted, the total bit rate is less than 40kb/s. As compared with the full face texture update approach [9], where each frame (600*600 pixels) produces 1632 bytes data with SNR 35.5dB (bitrate over 200kb/s), the computation intensity is greatly reduced and the coding efficiency is obviously improved to a large degree using our active texture update method.

[1a] [1b] [1c] [1a]
[2a] [2b] [2c] [2a]
[3a] [3b] [3c] [3a]
[4a] [4b] [4c] [4a]
[5a] [5b] [5c] [5a]
Figure 7. Facial texture synthesis: Row1: fitted model; Row2: original texture; Row3: normalized texture; Row4: updated textures of interest; Row5: synthesized face; (from left to right: frame 20, 32, 42, 62 of video Mario)

To compose the texture smoothly, the decoded textures are repaired in the transition border area using the temporal blending and spatial filtering technique, Figure 7 (Row5) shows the synthesized faces. The synthesis with texture updating and blending is much better than the one without texture updating and blending (as seen in Figure 8 and Table 1 for comparison). Notice in Figure 8 that the third face from the left, the one without texture update, does not seem to capture the smiling expression as accurately as the rightmost one, the one with texture update and blending. Also, patches around the cheeks are evident in the fourth face from the left, the one without blending.

[1] [2] [3] [4] [5]
Figure 8. (from left) (1) original frame1; (2) original frame15; (3) Synthesized frame15 without wrinkle texture update (only frame1 texture used); (4) Synthesized face with wrinkle texture update but no blending the borders; (5) Synthesized frame15 with texture update and blending.

Scheme Subjective rating SNR (dB)
No update 2/5 (poor) 25.7 (average)
update/no blending 3/5 (fair ) 33.1 (average)
update+blending 4/5 (good) 34.6 (average)
Table 1: Comparison of texture update and blending

<-- Back to Table of Contents>

6. Conclusion

In this paper we proposed a partial facial texture update and synthesis strategy for model based coding. The experimental results demonstrate the advantages of our scheme over the entire texture update method; the texture synthesis using temporal blending and spatial filtering is effective in producing perceptually smooth results. In the future, model adaptation with a large movement (e.g., rotation) and the corresponding texture synthesis strategy will be investigated. The computation load needs to be further reduced so that a real time system can be realized.

<-- Back to Table of Contents>


End notes

(*) Please see http://www.cs.ualberta.ca/~anup/MPEG4/Demo.html for color demo clips (video) on various parts of our implementation.
<-- Back to Table of Contents>

Bibliography

[1] K. Aizawa and T. S. Huang , "Model-based image coding: Advanced video coding techniques for very low bit-rate application" Proceedings of the IEEE, 2(2):259-271, 1995.
[2] P. Ellis, "Recognizing faces", British Journal of Psychology, 66(4):409-426, 1975.
[3] B. Guenter, C. Grimm, D. Wood, H. Malvar and F. Pighin, "Making faces" SIGGRAPH98, pp.55-66, Orlando, FL., July, 1998.
[4] C. Choi, K. Aizawa, H. Harashima and T. Takebe, "Analysis and synthesis of facial image sequences in model-based image coding" IEEE Transactions on Circuits and Systems for Video Technology, 4(6):257-275, June, 1994.
[5] D. Terzopoulos and M. Vasilescu,"Sampling and reconstruction with adaptive meshes", Proceedings of 1991 IEEE International Conference on Computer Vision and Pattern Recognition (CVPR'91), pp70-75, 1991
[6] F. Pighin, J. Hecker, D. Linchinski, R. Szeliski and D. Salesin, "Synthesizing realistic facial expressions from photographs", SIGGRAPH98, pp.75-84, Orlando, FL., July, 1998.
[7] H.H. Chen, T. Ebrahimi, G. Rajan, C. Horne, P.K. Doenges and L. Chiariglione, "Special issue on synthetic/natural hybrid video coding", IEEE Transactions on Circuits and Systems for Video Technology, 9(2), March, 1999.
[8] I. A. Essa and A. Pentland, "Coding, analysis, interpretation, and recognition of facial expressions", IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):757-763, 1997.
[9] J. Strom, F. Davoine, J. Ahlberg, H. Li and R. Forchheime, "Very low bit rate facial texture coding", International Workshop on Synthetic-Natural Hybrid Coding and Three Dimensional Imaging (IWSNHC3DI'97) , pp.237-240, Rhodes, Greece, September, 1997.
[10] K. Nishini, Y. Sato and K. Ikeuchi, "Eigen-texture method: Appearance compression based on 3D model", Proceedings of 1999 IEEE International Conference on Computer Vision and Pattern Recognition (CVPR'99), Fort Lollins, CO. 1999.
[11] R. Chellappa, C. Wilson and A. Sirohey, "Human and machine recognition of faces: a survey" Proceedings of the IEEE, 83(5):705-740, 1995.
[12] L. Yin and A. Basu, "Generating realistic facial expressions with wrinkles for model based coding", Technique Report, at ftp://ftp.cs.ualberta.ca/pub/lijun/TechReport99_faceexpr.ps.gz, Department of Computing Science, University of Alberta, 1999.
[13] L. Yin and A. Basu, "Integrating active face tracking with model based coding", Pattern Recognition Letters, 20(6):651-657, 1999.
[14] L. Yin and A. Basu, "Realistic animation using extended adaptive mesh for model based coding", 2nd International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition (EMMCVPR'99), pp189-201, Springer-Verlag Lecture Notes Series on Computer Science. York, UK. July, 1999.
<-- Back to Table of Contents>