Talking Heads: Facial Animation Kinematics-Based Synthesis of Realistic Talking Faces


AUDIBLE-VISIBLE SYNTHESIS

    The kinematics-based method of animating talking faces entails three principal steps:
data acquisition, analysis, and animation.

Data Acquisition
   Two types of data are currently required, time- varying facial motion and static representations of the 3D head plus video texture map. These are shown in Figure 1.

Figure 1a
a.
Figure 1b
b.
Figure 1. Basic head data for the two measurement systems:
  a. Marker positions for recording head and facial motion data;
  b. Original full-head mesh from a static 3D scan.

Facial Motion And Acoustics
   Three-dimensional position data were recorded optoelectronically (OPTOTRAK) for 18 orofacial locations (for ired positions, see Figure 1a) during recitations of excerpts from a Japanese children's story (Momotarou) by a male Japanese speaker. Position measures were digitized at 60 Hz along with simultaneous recording of the speech at 10 kHz. Since head motion is normally large relative to facial motions, its effects on the absolute position of the facial measures must be removed. Therefore, rigid body head motion was also measured using 5 ireds attached to a lightweight appliance worn on the head (see Figure 1a). A quarternion method [13] was used to decompose the head motion into its six rotations and translations and to calculate the independent motion of the facial markers [for processing details, see 14].

Figure 2
Figure 2. Eight 3D faces extracted from full-head scans during sustained production of:
        a. five Japanese vowels - /a, i, u, e, o/ and
        b. three non-speech postures - open mouth, relaxed closed mouth, and clenched closed mouth.

Static Face Scans
   A set of eight full-head 3D scans and video texture maps covering a range of speech and non-speech orofacial configurations (see Figure 2) was obtained with a laser range scanner (Cyberware, Inc.). The set consisted of static configurations for the five Japanese vowels (/a, i, u, e, o/) and three non-speech configurations: mouth wide-open, mouth closed with teeth clenched, and mouth closed but relaxed. Scan resolution was 512 x 512 pixels. The average resolution of each extracted face is somewhat less than 300 x 300, containing 71100 nodes and 141,900 polygons. Feature contours for the eyes, nose, jaw, and lip contours (inner and outer) are identified for each scan along with the 18 positions approximating the placement of the ired markers. Node coordinates are converted from cylindrical (r, theta) to Cartesian 3D (x, y, z).

Analysis
   The analysis techniques outlined below entail art for mesh initialization, field morphing for mesh adaptation, and multilinear techniques for extracting control parameters from the scanned face data.

Figure 3
Figure 3. Mesh adaptation entails matching feature contours of the generic mesh
(a) with features for each scanned face (b). Generic mesh nodes are adjusted to match position measurement locations (c) and then the texture map is applied.
(Click here for an expanded view of Figure 3.)

Face And Lip Mesh Adaptation
   A generic mesh for the face (exclusive of the lips) consisting of only N = 576 nodes and 844 polygons is used to reduce the computational complexity of the original 3D scans. As can be seen in Figure 3a, nodes are most heavily concentrated periorally, along the nose, and especially around the eyes, but are fairly sparsely distributed elsewhere. The feature contours for eyes, nose, jaw, and lip outer contour are identified on the mesh (Figure 3b). For each of the eight face scans, the mesh is lined up along the feature contours and nodes are adjusted to match the 18 approximated marker positions. The remaining mesh nodes are then adjusted through field morphing [15] and the texture map is reattached (Figures 3b,c).
    Each adapted facial mesh is expressed as a column vector f containing 3N nodes, representing the x, y, and z values for each 3D node. Since K = 8 facial meshes were made, the ensemble of adapted mesh nodes is arranged in matrix form as

F = [f1, f2, ... fK].    (1)

The "mean face" µ is then defined as the average value of each row of F, and subtracted from each column of F generating

FO = [fo1, fo2, ... foK].    (2)

the matrix of facial deformations from the mean face. Any facial shape can now be expressed by the sum

f = fo + µf    (3)


   The outer and inner lip contours specified in each face scan are used to constrain a lip mesh consisting of 600 nodes and 1100 polygons. Each contour consists of 40 nodes on the lip mesh. A third lip contour is linearly interpolated midway between the two original contours on the scanned surface. The lip mesh is then numerically generated using cubic spline interpolation of the orthogonal triplets of control points from the three contours. Currently, the lip mesh is attached to the face mesh at the border of the outer lip contour and is passively deformed by the deformation of the face mesh, therefore it is not included in the estimation of the mean face or subsequent principal component analysis (PCA).

Facial PCA
   The principal components of F can be found by applying singular value decomposition (SVD) to the covariance matrix

Cf. = FO FOt,    (4)
yielding
Cf. =U S Ut.    (5)

U is a unitary matrix whose columns contain the eigenvectors of Cf. normalized to unit length. S is a diagonal matrix whose diagonal entries are the respective eigenvalues.

    Since the ensemble consists of eight facial shapes, only the first seven eigenvalues are larger than zero and consequently only the first seven columns of U are meaningful. In fact, the first five eigenvectors account for more than 99% of the variance observed in the data [Each eigenvalue of S denotes the variance accounted for by the respective eigenvector; thus the sum of all eigenvalues is the total variance].
    The first seven columns of U are the principal components that can be used to express any facial shape as

fo = U7a    (6)

where U7 is the matrix formed by the first seven columns of U, and a is the vector of principal component coefficients determined by

a = U7t fo .    (7)

Since U7 is fixed, facial deformations can be represented by the seven coefficients contained in a. Thus, for the eight shapes derived from the 3D scans,

a = [a1, a2, ... aK] .    (8)

Linear Estimator
   In order ultimately to drive the facial animation from time-varying marker data, it is first necessary to relate the 18 marker locations with the rest of the mesh nodes for each of the eight adapted facial meshes. This is done by calculating a linear estimator, whose reliability is likely to be good given that the number of marker locations (18) is substantially larger than the number of eigenvectors (7) needed to recover the variance.
    For each face scan, the 54 (18 markers x 3 axes) positions were expressed by a column vector p. Since the values in p are a subset of the values of f, they can be extracted and arranged in the matrix

P = [p1, p2, ... pK] .    (9)

Removal of mean position gives

PO = [po1, po2, ... poK] .    (10)

PO and a were then used to determine a minimum mean squared error (MMSE) estimator:

a = A PO        (11)

A = a POt (PO POt) -1 .    (12)

Animating Facial Motion
   Once the linear estimator is determined from the eight facial scans, it can be applied to the position data measured by the OPTOTRAK on a sample by sample basis or through interpolation of via point arrays.
   Any vector p can be used to estimate the complete facial shape as follows:

f = µf. + fo    (13)

f = µf. + U7 a    (14)

f = µf. + U7 A po .    (15)

   The natural head motion can be restored using the rigid body components derived during data processing. Otherwise, the head can be fixed at any orientation desired.

Direct Animation from Position
   Since the marker data were obtained at 60 Hz and the North American/Japanese video standard was used, the animation sequences can be generated simply by configuring one video field from the marker values at each time sample. Of course, the position data can be decimated to fit any desired animation rate such as 25 fps (European video), though rates at or below 15 fps (typical for QuickTime movies) are close to the threshold for the visual enhancement effect on speech [16].

Interpolation of Via Point Arrays
   An alternative and, from our point of view, more promising approach is to use via point analysis to extract arrays of position values at a slower, user-configurable sampling rate. The via point arrays are extracted using a 5th-order spline function. This function is numerically equivalent to a kinematic smoothness function that minimizes jerk (rate of change of acceleration) [17] and is well suited to describing control of biological movements. First applied to planar arm movements by Flash and Hogan [18], the minimum jerk function has proved useful in predicting point-to-point movements as well as a range of via-point movements (analogous to key frames in animation) such as handwriting [10] and speech articulation [19]. Formally, the minimum jerk criterion provides unique solutions to trajectory control from knowledge of only the initial, final, and via-point positions and the movement duration. The function is given here for the case of planar movements where X, Y are Cartesian coordinates and tf is the movement duration:

formula 16   (16)


   Once extracted, the via point arrays represent key orofacial configurations that are used to recover continuous facial deformations through interpolation of the 5th-order spline function. The quality of the recovery with respect to the original motion of the face is controlled by the error criterion (in this case, maximum distance error) set by the user. A weaker error constraint results in fewer via point arrays and hence greater data reduction. Figure 4 shows position-time series for two dimensions of facial motion, the extracted via point arrays, and the position paths recovered through interpolation of the via points for the first 1.5 seconds of a sentence utterance. Figure 5 gives a flavor of the data reduction and subsequent interpolation of facial configurations between extracted via point arrays.
 

Introduction | Audible-Visible Synthesis | Extensions to the Basic Model
Issues of Realism | Conclusion | References

 

PREVIOUS CONTENTS NEXT