Talking Heads: Facial Animation


Kinematics-Based Synthesis of Realistic Talking Faces

(This section is based on a paper by Eric Vatikiotis-Bateson, Takaaki Kuratate, Mark Tiede, and Hani Yehia, of the ATR Human Information Processing Research Laboratories, entitled, "Kinematics-Based Synthesis of Realistic Talking Faces", which has been submitted for publication. This paper has been adapted for the Talking Heads website by Philip Rubin and Eric Vatikiotis-Bateson.)

A new method is described for animating talking faces that are both cosmetically and communicatively realistic. The animations can be driven directly from a small set of time-varying positions measured on the face at the video field rate or at lower rates by interpolating key frame configurations derived by via point analysis. This method of animation provides distinct benefits for both industrial and behavioral research applications, because the kinematic control parameters are easily obtained and are highly correlated with the measurable acoustic and neuromuscular events associated with speech production.

INTRODUCTION:

   During spoken communication, speakers' faces convey all sorts of relevant information, not the least of which are visible, time-varying correlates of the activity of the vocal tract that shapes the speech acoustics [1, 2]. Contrary to popular belief and the common practice of speech researchers and engineers tackling the problem of audiovisual synthesis and recognition, the visible correlates of speech are not limited to the small area enclosing the lips, oral aperture, and even the chin [for overview, see 3]. Rather, the entire face - certainly everything below the eyes - contributes information about the speech signal [4, 5]. Also, visible correlates of the speech are not restricted to a small set of phonetic elements, defined by the shape and position of the most visible articulators: the lips and less directly jaw height. Instead, the correlation appears to be much more continuous throughout the production of speech [6].
    To the extent that visible acoustic correlates can be computed, they are available to machine recognizers. Whether or not human perceivers make use of, or even detect, such information is a long-range goal of this research. Ancillary to that is the need to animate synthetic talking faces that minimally contain the same audible-visible correlates observed in human orofacial motion. We term this 'communicative realism,' bearing in mind that initially our focus will be limited to the visible-acoustic or phonetic aspects of facial motion [7], not the more comprehensive domains of facial expressions of emotion and paralinguistic gestures accommodating prosody and discourse. A second criterion for animating talking faces that can prove useful in both industrial applications and behavioral research is that the animated faces should be cosmetically real. With very few exceptions [e.g. 8, 9], talking faces have been cartoon caricatures that do not look like real people. In what follows, a new system is described aimed at animating talking faces that are both cosmetically and communicatively realistic. Animations are driven by a small set of positions on the face measured at the video field rate. Lower bit rates (< 100 bps) can be achieved by interpolating key frame configurations of the measured positions derived by via point analysis [10]. The animations can then be synchronized with the natural acoustic signal or with an highly intelligible acoustic signal synthesized from facial motion parameters [11]. Many cosmetic details of the full facial model are not yet implemented (e.g., teeth, eyes, hair). However, unlike other systems, the model is driven by measurable parameters whose correlation to acoustic, articulatory, and physiological levels of observation have been examined [1, 12]. This makes the system extremely useful for audiovisual research and applications development, and can serve as a common platform for integrating the full range of human orofacial behaviors such as expressions of emotion, communicative gestures and speech that tend, in reality, to all occur simultaneously.
 

Audible-Visible Synthesis

Extensions to the Basic Model

Issues of Realism

Conclusion

References


 

PREVIOUS CONTENTS NEXT