|
AUDIBLE-VISIBLE SYNTHESIS
The kinematics-based method of animating
talking faces entails three principal steps:
data acquisition, analysis, and animation.
Data Acquisition
Two types of data are currently required, time-
varying facial motion and static representations of
the 3D head plus video texture map. These are
shown in Figure 1.
a. |
b. |
| Figure 1. |
Basic head data for the two measurement systems:
a. Marker positions for recording head and
facial motion data;
b. Original full-head mesh from a static 3D scan.
|
Facial Motion And Acoustics
Three-dimensional position data were recorded
optoelectronically (OPTOTRAK) for 18 orofacial
locations (for ired positions, see Figure 1a) during
recitations of excerpts from a Japanese children's
story (Momotarou) by a male Japanese speaker.
Position measures were digitized at 60 Hz along
with simultaneous recording of the speech at 10
kHz. Since head motion is normally large relative to
facial motions, its effects on the absolute position of
the facial measures must be removed. Therefore,
rigid body head motion was also measured using 5
ireds attached to a lightweight appliance worn on
the head (see Figure 1a). A quarternion method
[13] was used to decompose the head motion into
its six rotations and translations and to calculate the
independent motion of the facial markers [for processing details, see 14].
|
Figure 2. Eight 3D faces extracted from full-head scans during sustained production of:
a. five Japanese vowels - /a, i, u, e, o/ and
b. three non-speech postures - open mouth, relaxed closed mouth, and clenched closed mouth.
|
Static Face Scans
A set of eight full-head 3D scans and video texture maps
covering a range of speech and non-speech orofacial configurations (see Figure 2)
was obtained with a laser range scanner (Cyberware,
Inc.). The set consisted of static configurations for
the five Japanese vowels (/a, i, u, e, o/) and three non-speech
configurations: mouth wide-open, mouth
closed with teeth clenched, and mouth closed but
relaxed. Scan resolution was 512 x 512 pixels. The
average resolution of each extracted face is somewhat
less than 300 x 300, containing 71100 nodes
and 141,900 polygons. Feature contours for the
eyes, nose, jaw, and lip contours (inner and outer)
are identified for each scan along with the 18 positions
approximating the placement of the ired
markers. Node coordinates are converted from cylindrical
(r, theta) to Cartesian 3D (x, y, z).
Analysis
The analysis techniques outlined below entail art
for mesh initialization, field morphing for mesh adaptation,
and multilinear techniques for extracting
control parameters from the scanned face data.
|
Figure 3. Mesh adaptation entails matching feature
contours of the generic mesh
(a) with features for
each scanned face (b). Generic mesh nodes are adjusted
to match position measurement locations (c)
and then the texture map is applied.
(Click here
for an expanded view of Figure 3.)
|
Face And Lip Mesh Adaptation
A generic mesh for the face (exclusive of the
lips) consisting of only N = 576 nodes and 844
polygons is used to reduce the computational complexity
of the original 3D scans. As can be seen in
Figure 3a, nodes are most heavily concentrated periorally,
along the nose, and especially around the
eyes, but are fairly sparsely distributed elsewhere.
The feature contours for eyes, nose, jaw, and lip
outer contour are identified on the mesh
(Figure 3b).
For each of the eight face scans, the mesh is
lined up along the feature contours and nodes are
adjusted to match the 18 approximated marker positions.
The remaining mesh nodes are then adjusted
through field morphing [15] and the texture map is
reattached (Figures 3b,c).
Each adapted facial mesh is expressed as a column
vector f containing 3N nodes, representing the
x, y, and z values for each 3D node. Since K = 8
facial meshes were made, the ensemble of adapted
mesh nodes is arranged in matrix form as
F = [f1, f2, ... fK]. (1)
The "mean face" µ is then defined as the average value of each row of F,
and subtracted from each column of F generating
FO = [fo1, fo2, ... foK]. (2)
the matrix of facial deformations from the mean
face. Any facial shape can now be expressed by the sum
f = fo + µf (3)
The outer and inner lip contours specified in
each face scan are used to constrain a lip mesh consisting
of 600 nodes and 1100 polygons. Each
contour consists of 40 nodes on the lip mesh. A
third lip contour is linearly interpolated midway
between the two original contours on the scanned
surface. The lip mesh is then numerically generated
using cubic spline interpolation of the orthogonal
triplets of control points from the three contours.
Currently, the lip mesh is attached to the face mesh
at the border of the outer lip contour and is passively
deformed by the deformation of the face
mesh, therefore it is not included in the estimation
of the mean face or subsequent principal component analysis (PCA).
Facial PCA
The principal components of F can be found by
applying singular value decomposition (SVD) to
the covariance matrix
Cf. = FO FOt, (4)
yielding
Cf. =U S Ut. (5)
U is a unitary matrix whose columns contain the
eigenvectors of Cf. normalized to unit length.
S is a
diagonal matrix whose diagonal entries are the
respective eigenvalues.
Since the ensemble consists of eight facial
shapes, only the first seven eigenvalues are larger
than zero and consequently only the first seven columns
of U are meaningful.
In fact, the first five eigenvectors
account for more than 99% of the variance observed in
the data [Each eigenvalue of S
denotes the variance accounted for by the respective
eigenvector; thus the sum of all eigenvalues is the
total variance].
The first seven columns of U are the principal
components that can be used to express any facial
shape as
fo = U7a (6)
where U7
is the matrix formed by the first seven
columns of U,
and a is the vector of principal component
coefficients determined by
a = U7t fo . (7)
Since U7
is fixed, facial deformations can be represented
by the seven coefficients contained in
a.
Thus, for the eight shapes derived from the 3D scans,
a = [a1,
a2, ...
aK] . (8)
Linear Estimator
In order ultimately to drive the facial animation
from time-varying marker data, it is first necessary
to relate the 18 marker locations with the rest of the
mesh nodes for each of the eight adapted facial
meshes. This is done by calculating a linear estimator,
whose reliability is likely to be good given that
the number of marker locations (18) is substantially
larger than the number of eigenvectors (7) needed
to recover the variance.
For each face scan, the 54 (18 markers x 3 axes)
positions were expressed by a column vector p.
Since the values in p
are a subset of the values of f,
they can be extracted and arranged in the matrix
P = [p1,
p2, ...
pK] . (9)
Removal of mean position gives
PO = [po1,
po2, ...
poK] . (10)
PO and a
were then used to determine a minimum mean squared error (MMSE) estimator:
a = A PO (11)
A =
a POt
(PO POt)
-1 . (12)
Animating Facial Motion
Once the linear estimator is determined from the
eight facial scans, it can be applied to the position
data measured by the OPTOTRAK on a sample by
sample basis or through interpolation of via point
arrays.
Any vector p
can be used to estimate the complete facial shape as follows:
f = µf. + fo (13)
f = µf. + U7 a (14)
f = µf. + U7 A po . (15)
The natural head motion can be restored using
the rigid body components derived during data
processing. Otherwise, the head can be fixed at any
orientation desired.
Direct Animation from Position
Since the marker data were obtained at 60 Hz
and the North American/Japanese video standard
was used, the animation sequences can be generated
simply by configuring one video field from the
marker values at each time sample. Of course, the
position data can be decimated to fit any desired
animation rate such as 25 fps (European video),
though rates at or below 15 fps (typical for QuickTime movies)
are close to the threshold for the visual enhancement effect on speech [16].
Interpolation of Via Point Arrays
An alternative and, from our point of view, more
promising approach is to use via point analysis to
extract arrays of position values at a slower, user-configurable
sampling rate. The via point arrays are
extracted using a 5th-order spline function. This
function is numerically equivalent to a kinematic
smoothness function that minimizes jerk (rate of
change of acceleration) [17] and is well suited to
describing control of biological movements. First
applied to planar arm movements by Flash and
Hogan [18], the minimum jerk function has proved
useful in predicting point-to-point movements as
well as a range of via-point movements (analogous
to key frames in animation) such as handwriting
[10] and speech articulation [19]. Formally, the
minimum jerk criterion provides unique solutions to
trajectory control from knowledge of only the initial,
final, and via-point positions and the movement
duration. The function is given here for the case of
planar movements where
X, Y
are Cartesian coordinates and
tf
is the movement duration:
(16)
Once extracted, the via point arrays represent key
orofacial configurations that are used to recover
continuous facial deformations through interpolation
of the 5th-order spline function. The quality of
the recovery with respect to the original motion of
the face is controlled by the error criterion (in this
case, maximum distance error) set by the user. A
weaker error constraint results in fewer via point
arrays and hence greater data reduction.
Figure 4
shows position-time series for two dimensions of
facial motion, the extracted via point arrays, and the
position paths recovered through interpolation of
the via points for the first 1.5 seconds of a sentence
utterance.
Figure 5 gives a flavor of the data reduction
and subsequent interpolation of facial configurations
between extracted via point arrays.
Introduction
|
Audible-Visible Synthesis
|
Extensions to the Basic Model
Issues of Realism
|
Conclusion
|
References
|