Talking Heads: Facial Animation Kinematics-Based Synthesis of Realistic Talking Faces


ISSUES OF REALISM

    As can be seen in animation sequences derived using the statistical model presented here, the synthesized face bears a striking resemblance to the subject whose facial motion drives the model. The temporal match between the synthetic and the original behavior is essentially perfect, and the spatial deformations are on the whole faithfully recreated. The static shape of the upper lip is not quite right, which will exacerbate small estimation errors, particularly at the attachment points for the lip mesh. Complex audiovisual synchronization is not required, whether the original audio signal or the acoustics synthesized from the same facial motion parameters are used. Also, since rigid body head motion is controlled independently, faces can be presented at any orientation and with any degree of natural or unnatural motion. Finally, the faces can be synthesized from only five parameters, linearly derived from the kinematic data which are known to be highly correlated with the underlying muscle activity (EMG), position and shape of the tongue, and the speech acoustics. On the basis of these features, has cosmetic and communicative realism been achieved?

Cosmetic Realism
   Cosmetically, this model generates recognizable faces superior to caricature-style models such as those currently being developed for multimedia applications within the telecommunications industry. Its video and spatiotemporal realism also make our data-driven model better than those derived from Parke's FACS model [24]. For example, Massaro and Cohen [25] have extended Parke's FACS model [24] to audiovisual speech from text input for English. In addition to their cartoon-like quality, such models are controlled by static parameters that are themselves caricatures of anatomical and physiological structures. Benoît and colleagues have adapted the same model for French text-to-audiovisual synthesis by adding a 3D lip model whose parameters were statistically derived from static images for one speaker [26, 27]. The lip mesh used here is a heavily re-engineered descendant of the French lip model.
    In terms of video-image quality, there are two types of model that surpass ours in cosmetic realism. Among other things, these models can repre- sent hair, eyes, teeth, and even parts of the torso, all of which are missing from our current model. One type extends the muscle-based facial motion models developed by Waters and Terzopoulos [28, 29]. These models use video texture maps and the deformation of sparse 3D polygon meshes to synthesize realistic facial motion [30]. However, like their predecessor, the Parke model, these models do not use time-varying physiological measures either to verify or to parametrize the dynamics and subsequent behavior of the model system. At best, stylized estimates of skeleto-muscular and facial structure have been derived from static measures such as computer tomography [e.g., 29, 31, 32].
    Several models of this type have been adapted for synthesis of facial motion associated with speech [e.g., 9, 33]. Lucero et al have extensively re-worked the structures controlling the model's dynamics, e.g., implementation of more realistic parameters constraining muscle force generation. The resulting model is now stable enough to be driven by the continuous muscle activity signals (EMG) recorded contralaterally to the same sort of movement data used to drive our current model. Although much improved, the animation is computationally expensive and has yet to be synchronized with the acoustics.
    The second type of model achieving substantially better cosmetic realism than ours is the Video Rewrite system developed by Bregler and colleagues [8]. Video Rewrite concatenates audiovisual triphones into synthetic sequences allowing the speech of one person to be audiovisually dubbed onto the background image of another person. This is a very compelling system with possibly only one cosmetic drawback; by dubbing only the portion of the face containing the mouth and chin, there may be a visual conflict between the motion of the cheeks in the background image and the mouth-chin of the dubbed portion. As discussed above, we have consistently found high correlations between the motion of the chin and the cheeks for all of the speakers examined thus far. An example of this is shown graphically in Figure 4 where, even though the range of vertical position for a location on the upper cheek is only about 1 mm, the time-series pattern matches very closely that of the chin.

Communicative Realism
   The extent to which the model is communicatively real is currently being tested in perception and functional MRI studies using model generated animations. Minimally we expect the results of the perception studies to be as good as those of the Parke model derivatives. Psychometric tests using such models [e.g., 34, 35, 36] have shown that audiovisual presentations in noisy acoustic conditions enhance speech intelligibility along the lines of that observed by Sumby and Pollack [37] for natural faces. However, the cause of the enhancement is not known. Indeed, recreating the general spatial (amplitude) and temporal (synchrony) properties of the audible-visible behavior, as done in cartoon animation, may be enough to enhance the intelligibility of the acoustic signal somewhat, simply because the viewing listener is given visual information about the framing (prosody, syllable structure) that entrains the auditory system to detect phonetic content (consonant and vowel segments). A potential advantage of our data-driven model is the measurable cross-modal correlation between the acoustics and the facial motion data driving the model. Thus, we hope to determine the extent to which the increased intelligibility of audible-visible stimuli (over audible alone) is due to the presence of visible information specific to visual phonetic and/or higher (e.g., lexical, syntactic) processing levels, rather than simply the synchronization of audible and visible stimuli.
 

Introduction | Audible-Visible Synthesis | Extensions to the Basic Model
Issues of Realism | Conclusion | References

 

PREVIOUS CONTENTS NEXT