|
ISSUES OF REALISM
As can be seen in animation sequences derived
using the statistical model presented here, the
synthesized face bears a striking resemblance to the
subject whose facial motion drives the model. The
temporal match between the synthetic and the original
behavior is essentially perfect, and the spatial
deformations are on the whole faithfully recreated.
The static shape of the upper lip is not quite right,
which will exacerbate small estimation errors,
particularly at the attachment points for the lip mesh.
Complex audiovisual synchronization is not required,
whether the original audio signal or the
acoustics synthesized from the same facial motion
parameters are used. Also, since rigid body head
motion is controlled independently, faces can be
presented at any orientation and with any degree of
natural or unnatural motion. Finally, the faces can
be synthesized from only five parameters, linearly
derived from the kinematic data which are known to
be highly correlated with the underlying muscle
activity (EMG), position and shape of the tongue,
and the speech acoustics. On the basis of these features,
has cosmetic and communicative realism been achieved?
Cosmetic Realism
Cosmetically, this model generates recognizable
faces superior to caricature-style models such as
those currently being developed for multimedia
applications within the telecommunications industry.
Its video and spatiotemporal realism also make our
data-driven model better than those derived from
Parke's FACS model [24]. For example, Massaro
and Cohen [25] have extended Parke's FACS
model [24] to audiovisual speech from text input
for English. In addition to their cartoon-like quality,
such models are controlled by static parameters that
are themselves caricatures of anatomical and
physiological structures. Benoît and colleagues have
adapted the same model for French text-to-audiovisual
synthesis by adding a 3D lip model
whose parameters were statistically derived from
static images for one speaker [26, 27]. The lip mesh
used here is a heavily re-engineered descendant of
the French lip model.
In terms of video-image quality, there are two
types of model that surpass ours in cosmetic realism.
Among other things, these models can repre-
sent hair, eyes, teeth, and even parts of the torso, all
of which are missing from our current model. One
type extends the muscle-based facial motion models
developed by Waters and Terzopoulos [28, 29].
These models use video texture maps and the deformation
of sparse 3D polygon meshes to synthesize realistic
facial motion [30]. However, like their
predecessor, the Parke model, these models do not
use time-varying physiological measures either to
verify or to parametrize the dynamics and subsequent
behavior of the model system. At best, stylized
estimates of skeleto-muscular and facial structure
have been derived from static measures such as
computer tomography [e.g., 29, 31, 32].
Several models of this type have been adapted
for synthesis of facial motion associated with speech
[e.g., 9, 33]. Lucero et al have extensively re-worked
the structures controlling the model's dynamics,
e.g., implementation of more realistic parameters
constraining muscle force generation. The
resulting model is now stable enough to be driven
by the continuous muscle activity signals (EMG)
recorded contralaterally to the same sort of
movement data used to drive our current model.
Although much improved, the animation is computationally
expensive and has yet to be synchronized with the acoustics.
The second type of model achieving substantially better
cosmetic realism than ours is the
Video Rewrite system
developed by Bregler and colleagues
[8]. Video Rewrite concatenates audiovisual
triphones into synthetic sequences allowing the
speech of one person to be audiovisually dubbed
onto the background image of another person. This
is a very compelling system with possibly only one
cosmetic drawback; by dubbing only the portion of
the face containing the mouth and chin, there may
be a visual conflict between the motion of the
cheeks in the background image and the mouth-chin
of the dubbed portion. As discussed above, we
have consistently found high correlations between
the motion of the chin and the cheeks for all of the
speakers examined thus far. An example of this is
shown graphically in Figure 4 where, even though
the range of vertical position for a location on the
upper cheek is only about 1 mm, the time-series
pattern matches very closely that of the chin.
Communicative Realism
The extent to which the model is
communicatively real is currently being tested in perception
and functional MRI studies using model generated
animations. Minimally we expect the results of the
perception studies to be as good as those of the
Parke model derivatives. Psychometric tests using
such models [e.g., 34, 35, 36] have shown that
audiovisual presentations in noisy acoustic conditions
enhance speech intelligibility along the lines
of that observed by Sumby and Pollack [37] for
natural faces. However, the cause of the enhancement
is not known. Indeed, recreating the general
spatial (amplitude) and temporal (synchrony) properties
of the audible-visible behavior, as done in
cartoon animation, may be enough to enhance the
intelligibility of the acoustic signal somewhat, simply
because the viewing listener is given visual information
about the framing (prosody, syllable
structure) that entrains the auditory system to detect
phonetic content (consonant and vowel segments).
A potential advantage of our data-driven model is
the measurable cross-modal correlation between the
acoustics and the facial motion data driving the
model. Thus, we hope to determine the extent to
which the increased intelligibility of audible-visible
stimuli (over audible alone) is due to the presence
of visible information specific to visual phonetic
and/or higher (e.g., lexical, syntactic) processing
levels, rather than simply the synchronization of
audible and visible stimuli.
Introduction
|
Audible-Visible Synthesis
|
Extensions to the Basic Model
Issues of Realism
|
Conclusion
|
References
|