IEEE
International Conference on Multimedia & Expo 2000
New York, New York
Electronic Proceedings
© 2000 IEEE
Partial Update of Active Textures for Efficient Expression Synthesis in
Model-Based Coding
-
Lijun Yin and Anup Basu
-
Department of Computing Science
-
University of Alberta, Edmonton, AB
-
T6G 2H1, Canada
Email: {lijun, anup}@cs.ualberta.ca
-
http://www.cs.ualberta.ca/~lijun
-
http://www.cs.ualberta.ca/~anup
Abstract
An image-based facial expression synthesis method is presented in this
paper for a realistic face animation in model based coding. Differing
from the conventional whole texture update method, we propose a partial
texture update scheme in Active Texture areas to reduce the bitrate
efficiently. Facial expressions are synthesized using temporal blending
and spatial filtering so that a smooth texture fusion can be achieved.
Active texture extraction, compression and composition are the principal
steps in realizing the proposed scheme. The experiments on video
sequences demonstrate the advantage of the proposed algorithm in low cost
transmission with high fidelity reproduction. The life-like expressive
human faces with wrinkles are generated at a bitrate of less than 40kb/s.
Table of Contents
1. Introduction
Image-based texture synthesis has recently attracted more attention for
creating a realistic object. Several works have been reported in the scenario
of SNHC formed within MPEG4 [7][1],
multi-texture mapping and eigen texture analysis [10][11].
One of its important applications is to generate a realistic face expression
[3][6]. The vivid facial
textures, such as furrows, are very difficult to capture and model.
The most widely used facial action coding system (FACS) has also been shown
to have difficulty producing a large variety of complicated facial
expressions [8]. Due to the difficulties in current
computer graphics technology, facial texture extraction and update are
the best choice for the current application in model based coding.
However, the entire texture updating using traditional waveform encoding
methods ( e.g. , JPEG, MPEG) is an expensive way requiring high
bit rates. Strom et al. [9] coded the
entire face texture by employing the classical eigen face approach, while
Guenter et al. [3] produced a face by capturing
six camera views in high resolution. Both approaches generate an impressive
image quality, however they require high bitrate ($>200kb/s$), high computation
complexity and large storage space for textures. Choi et al. [4]
encoded the texture in two folds: orthonormalization for global texture
and simple quantization for local texture, very low transmission cost is
achieved. However the local texture is updated at a constant rather than
dynamic rate, no feature detection and active texture detection are addressed.
In this paper, we propose an efficient method to detect and update the
textures of interest in active areas, and eventually synthesize the facial
expressions using these captured textures and the first frame texture by
a texture blending technique. Psychology research shows that a significant
contribution to a realistic facial expression comes not only from facial
organs (i.e., eyes, mouth), but also from facial wrinkles generated
from the expressions [2] . The most significant
wrinkles on a human face are categorized into four types: forehead, glabella,
crows-feet and nasolabial, as shown in Figure
3(a). Textures of interest (TOI) are defined in the four types of
wrinkle areas and mouth-eye areas as shown in Figure
3(c). Among the TOI, only the active textures (AT) which are generated
along with the different expressions are used for image synthesis. After
the model adaptation, active texture extraction and compression, the facial
expression can be generated by composing the AT with the 1st frame texture
using a temporal blending and spatial filtering technique. The system is
outlined in Figure 1 .
Figure 1. Flow chart of the system composition.
In Section 2, facial model adaptation is
described. Section 3 explains the AT detection
and compression. Face texture synthesis using a blending technique is described
in Section 4 followed by the experiments
of Section 5. Concluding remarks are given
in Section 6.
2. Face model adaptation
In order to obtain the textures of interest, a 3D wireframe model is first
matched onto an individual face to track the motion of facial expressions.
A new dynamic mesh matching algorithm is employed. It consists of two stages:
2.1 Feature detection
The feature vertices on the model correspond to a number of feature points
on a face located on the contour of eye, mouth, eyebrow, nostril, nose
side and head silhouette. The active tracking algorithm for tracking the
head silhouette and the deformable template matching using color (saturation,
hue) information for organ's contours are developed in [13],
eyebrow contour can be obtained by integration projection method, and the
nostril, nose side and shape are estimated by a statistical correlation
template matching [12]. Figure
2(b) shows the feature vertices of a model which correspond to the
feature points detected on the face.
2.2 Coarse-to-fine dynamic mesh adaptation
(a) Coarse adaptation: Dynamic mesh (DM) is a well known approach
for adaptive sampling of images and physically based modeling of non-rigid
objects [5], it can be assembled from nodal
points connected by adjustable springs based on the features of images.
The fundamental equation is
where xi is the position of node i, fi
is the external force acting on node i, gi
is the internal force on node i due to the springs connected to
neighboring node j. We take the feature points extracted as the
mesh boundaries, and apply the above non-linear motion equation to make
the large movement converge quickly to the region of interest.
(b) Fine adaptation: To overcome the convergence problem of numerical
solution of Equation 1, an energy-oriented mesh
method (EOM) to finely adjust the mesh is developed. EOM makes the mesh
movement in the direction of mesh energy decrease instead of decreasing
the node velocities and accelerations. The node energy is the sum of spring
energy connected to the node. The strain energy Eij
stored in a spring ij is
where r is the length of a deformed spring, l is the natural
length, C is the stiffness. For each node, only the movements that
decrease the node energy are allowed. The mesh gradually fits to the image
features and makes the adaptation very accurate [14].
Figure 2 shows some examples of model adaptation.
Figure 2. Extended dynamic mesh adaptation. (from left)
(a)face image; (b)feature vertices; (c)fitting on face; (d) adapted model.
3. Active texture detection
After the facial model is fitted onto a face in a video sequence, the facial
expressions are represented by a series of deformed facial models. As we
discussed in Section 1 , a deformed facial model
represents only a geometric structure of a facial expression, which mainly
reflects the expression of the eyes and mouth areas. This is not enough
to represent a realistic vivid expression of a face. In order to reconstruct
a lifelike expression of a face, it is necessary to provide the texture
information of the face corresponding to the individual expression. The
fitted model sketches the accurate location of texture areas. Twenty-seven
fiducial points are defined on the face for forming the textures of interest,
as shown in Figure 3(a)(b).
Figure 3. (from left) (a) Feature points defined on a human
face; (b) An example of feature points extracted on a person's face; (c)
textures of interest; (d) Normalization of textures of interest.
Since the size and shape of textures of interest vary with different expressions,
the adapted facial model must be warped to a standard shape and size (i.e.,
the first frame). This is a non-linear texture warping. After the normalization,
the subsequent faces with different expressions have the same shape, but
different texture, than the first frame. Since the fiducial points are
extracted in the previous stage, the TOI areas can be determined by the
geometric relationship of these fiducial points. Figure
3(c) shows the textures of interest formed from the fiducial point
locations. Figure 3(d) shows the TOI in
wrinkle, eyes and mouth areas which are normalized to the standard shape.
To represent the facial texture efficiently, only the active textures
(AT), which are the typical textures representing significant facial expression,
need to be transmitted. To extract the active texture, a temporal correlation
thresholding algorithm is developed, in which the active texture shows
a lower correlation value between the current frame and the first frame.
The threshold value is obtained from the five sets of training sequences.
The low temporal correlation value distinguishes the active texture among
the textures of interest. For example, forehead wrinkle is detected as
shown in Figure 4, in which
the texture whose correlation value is less than the threshold (0.8) appears
as the active wrinkle at the time of ``surprise.'' The extracted active
texture can be represented by the composition of eigen textures for low
bitrate transmission (detailed in [12] due
to the space limitation).
Figure 4. Temporal correlation of wrinkle textures between
the first frame and the subsequent frames. The input video shows different
expressions: initial natural expression, eyes closing, smiling, laughing,
worry/sad, surprise.
4. Facial texture synthesis
Since the facial texture for each frame is synthesized by the first frame
image (for example, showing a natural expression) and the corresponding
decoded active textures, simply texture-mapping is not good enough on a
final appearance, especially on the boundary of the wrinkle texture, the
sharp brightness transition is obviously seen. To overcome this drawback,
a temporal blending method is applied to smooth the texture boundary, in
which the brightness of the active texture is linearly blended with the
image of the first frame on the boundary. The blending weight ranges from
0 to 1; the value being a function of the pixel position in the active
texture. In the large central area of the texture, the weight is set to
1. In the transition area the value decreases from 1 to 0, as the texture
gets closer to the boundary. The blended texture value is computed as follows:
where bk is the blended value of the kth frame
active texture at the location (x, y). f1 and fk
are the active texture values in frame 1 and frame k respectively. w is
the blending weight as depicted in Figure 5. (xe,
ye) is a variable position on the edge (borderline between the
central area and the transition area). r is a factor for normalizing
the values, here r is set to squareroot(2*Pi)*õ.
Figure 5. The weight value selection for texture blending.
The response of the human visual system tends to ``overshoot'' around the
boundary of regions of different intensity. The result of this brightness
perception is to make areas of constant intensity appear as if they had
varying brightness. This is the well known Mach-band effect. To
alleviate the Mach-band effect, in the transition area the weight value
is determined by a Gaussian distribution centered at variable positions
of (xe, ye) to make texture fusion smoother and the
intensity difference around the boundary region smaller. The variance õ
and constant c are adjustable, which are set here as õ=1.0
and c=0. Moreover, a spatial low-pass filter with a template 3*3
is applied to smooth the border. This temporal blending plus spatial filtering
approach makes the two textures synthesized visually smoother, some examples
are shown in Figure 6.
Figure 6. Close view of texture synthesis (Nasolabial):
(from left) original frame25; no Blend&Filter; with Blend&Filter
5. Experimental results
Video sequence (Mario) showing different expressions is the input for our
experiment, with a resolution of 509*461 pixels, and 2954 vertices on the
3D face model. The results of model adaptation, texture normalization and
detected active texture are shown in Figure
7(*). To compress the detected active
textures, the principal component transformation is applied to map the
active texture into the eigen space. Since nine active areas on a face
have small dimensions (e.g., 50*90 for forehead, 30*30 for glabella,
80*40 nasolabial, etc.), a small number of eigen textures are sufficient
to describe the principal characteristics of each active texture, i.e.,
12 eigen textures are selected for each AT on crow-feet and glabella areas,
25 eigen textures for each one of the remaining areas. Thus, if all nine
TOI are active, 186 data (=25*6+12*3) need be transmitted, each data can
be quantized to 5 bits; the data for each frame is then about 116 bytes
which produces a SNR of 34.6dB. In the case of 30Hz frame rate, the bit
cost for texture update is about 27kb/s. In the real situation, the texture
update does not occur on every TOI, neither on every frame. Therefore,
the bit cost is much less than the above maximum estimation. If the animation
parameters are counted, the total bit rate is less than 40kb/s. As compared
with the full face texture update approach [9],
where each frame (600*600 pixels) produces 1632 bytes data with SNR 35.5dB
(bitrate over 200kb/s), the computation intensity is greatly reduced and
the coding efficiency is obviously improved to a large degree using our
active texture update method.
Figure 7. Facial texture synthesis: Row1: fitted model; Row2: original texture;
Row3: normalized texture; Row4: updated textures of interest; Row5: synthesized face;
(from left to right: frame 20, 32, 42, 62 of video Mario)
To compose the texture smoothly, the decoded textures are repaired in the
transition border area using the temporal blending and spatial filtering
technique, Figure 7 (Row5) shows the synthesized
faces. The synthesis with texture updating and blending is much better
than the one without texture updating and blending
(as seen in Figure 8 and
Table 1 for comparison).
Notice in Figure 8 that the third face from the left,
the one without texture update, does not seem to capture the smiling expression
as accurately as the rightmost one, the one with texture update and blending.
Also, patches around the cheeks are evident in the fourth face from the
left, the one without blending.
Figure 8. (from left) (1) original frame1; (2) original frame15;
(3) Synthesized frame15 without wrinkle texture update (only frame1 texture used);
(4) Synthesized face with wrinkle texture update but no blending the borders;
(5) Synthesized frame15 with texture update and blending.
| Scheme | Subjective rating | SNR (dB)
| No update | 2/5 (poor) | 25.7 (average)
|
| update/no blending | 3/5 (fair ) | 33.1 (average)
|
| update+blending | 4/5 (good) | 34.6 (average)
| |
Table 1: Comparison of texture update and blending
6. Conclusion
In this paper we proposed a partial facial texture update and synthesis
strategy for model based coding. The experimental results demonstrate the
advantages of our scheme over the entire texture update method; the texture
synthesis using temporal blending and spatial filtering is effective in
producing perceptually smooth results. In the future, model adaptation
with a large movement (e.g., rotation) and the corresponding texture
synthesis strategy will be investigated. The computation load needs to
be further reduced so that a real time system can be realized.
<-- Back to Table of Contents>
End notes
(*) Please see
http://www.cs.ualberta.ca/~anup/MPEG4/Demo.html
for color demo clips (video) on various parts of our implementation.
Bibliography
-
[1] K. Aizawa and T. S. Huang , "Model-based image
coding: Advanced video coding techniques for very low bit-rate application"
Proceedings of the IEEE, 2(2):259-271, 1995.
-
[2] P. Ellis, "Recognizing faces", British Journal
of Psychology, 66(4):409-426, 1975.
-
[3] B. Guenter, C. Grimm, D. Wood, H. Malvar and
F. Pighin, "Making faces" SIGGRAPH98, pp.55-66, Orlando, FL., July,
1998.
-
[4] C. Choi, K. Aizawa, H. Harashima and T. Takebe,
"Analysis and synthesis of facial image sequences in model-based image
coding" IEEE Transactions on Circuits and Systems for Video Technology,
4(6):257-275, June, 1994.
-
[5] D. Terzopoulos and M. Vasilescu,"Sampling
and reconstruction with adaptive meshes", Proceedings of 1991 IEEE International
Conference on Computer Vision and Pattern Recognition (CVPR'91), pp70-75,
1991
-
[6] F. Pighin, J. Hecker, D. Linchinski, R. Szeliski
and D. Salesin, "Synthesizing realistic facial expressions from photographs",
SIGGRAPH98, pp.75-84, Orlando, FL., July, 1998.
-
[7] H.H. Chen, T. Ebrahimi, G. Rajan, C. Horne,
P.K. Doenges and L. Chiariglione, "Special issue on synthetic/natural hybrid
video coding", IEEE Transactions on Circuits and Systems for Video Technology,
9(2), March, 1999.
-
[8] I. A. Essa and A. Pentland, "Coding, analysis,
interpretation, and recognition of facial expressions", IEEE Transactions
on Pattern Analysis and Machine Intelligence, 19(7):757-763, 1997.
-
[9] J. Strom, F. Davoine, J. Ahlberg, H. Li and
R. Forchheime, "Very low bit rate facial texture coding", International
Workshop on Synthetic-Natural Hybrid Coding and Three Dimensional Imaging
(IWSNHC3DI'97) , pp.237-240, Rhodes, Greece, September, 1997.
-
[10] K. Nishini, Y. Sato and K. Ikeuchi, "Eigen-texture
method: Appearance compression based on 3D model", Proceedings of 1999
IEEE International Conference on Computer Vision and Pattern Recognition
(CVPR'99), Fort Lollins, CO. 1999.
-
[11] R. Chellappa, C. Wilson and A. Sirohey,
"Human and machine recognition of faces: a survey" Proceedings of the
IEEE, 83(5):705-740, 1995.
-
[12] L. Yin and A. Basu, "Generating realistic
facial expressions with wrinkles for model based coding", Technique Report,
at
ftp://ftp.cs.ualberta.ca/pub/lijun/TechReport99_faceexpr.ps.gz,
Department of Computing Science, University of Alberta, 1999.
-
[13] L. Yin and A. Basu, "Integrating active face
tracking with model based coding", Pattern Recognition Letters,
20(6):651-657, 1999.
-
[14] L. Yin and A. Basu, "Realistic animation
using extended adaptive mesh for model based coding", 2nd International
Workshop on Energy Minimization Methods in Computer Vision and Pattern
Recognition (EMMCVPR'99), pp189-201, Springer-Verlag Lecture Notes
Series on Computer Science. York, UK. July, 1999.