Talking Heads: Facial Animation Kinematics-Based Synthesis of Realistic Talking Faces


EXTENSIONS TO THE BASIC MODEL

    In this section, the scope of the facial animation model is expanded to other areas of our research in audiovisual speech production.

Acoustic Synthesis From Faces
   In addition to driving facial animations, the facial motion data can be used to synthesize the speech acoustics through their correlation with the amplitude and spectral properties of the acoustics. The multilinear techniques used to determine these correlations are described in detail elsewhere [7, 11, 20]. Briefly, even a smaller number of position locations (11-12) than the number used here (18) is sufficient to generate intelligible acoustics entirely from the face. What is crucial to the synthesis, however, is that points from the chin, the lips, and the cheek be used.

Recovery Of Tongue Positions
   The importance of the cheek region can also be seen in the recovery of vocal tract configurations from the facial motion data described in the same studies. By aligning vocal tract (midsagittal lips, tongue and jaw) and facial data collected on different occasions from the same speaker for the same utterances, tongue position could be estimated from the facial motion at better than 83% reliability. Particularly surprising was that the tongue tip could be recovered at about 90% reliability. Removal of the cheek positions from the estimation substantially reduced the strength of the correlation with the tongue, further demonstrating that the visible correlates of speech are not restricted to the lips and chin. It should be noted that from the standpoint of causality, estimation of vocal tract motion from facial motion is actually an inversion. The 'forward' estimation is that of the face from the vocal tract and has been done for an English speaker at better than 95% overall [7].

Synthesis Of The Tongue Tip
   The ability to recover the tongue tip motion from the face also suggests that a synthetic tongue tip could be realistically parametrized by the same facial motion data currently being used to animate the face. This will be implemented soon along with upper and lower dental arches.

Access To The Physiology
   This speaker and four other speakers of French and English have been recorded for similar tasks using unilateral arrays of 11-12 position sensors, but with the addition of hooked-wire muscle EMG inserted into 8-9 orofacial muscles on the opposite side of the face. These studies, which are part of a long-range study of speech motor control [21-23], have shown that facial motion can be estimated from muscle EMG. Using simple autoregressive models (second-order AR) and a short delay (< 20 ms), facial motions can be estimated at better than 80% reliability [e.g., 11]. In fact, these same data are used to drive the muscle-based model of Lucero and colleagues described below [9]. Taken together, the high correlations among facial and vocal tract kinematics and orofacial muscle EMG suggest a single scheme of neuromotor control for the production of audiovisual behavior [for discussion, see 5].

Text-To-Audible-Visible Speech
   As an extension of the via point analysis technique, the facial animation model can be driven concatenatively from text input using a codex of phoneme-specific via point arrays [10, 19]. Triphone-sized via point arrays are being extracted from recited sentence data such as those used in the analysis of the current data as well as much larger sets of semi-spontaneous utterances. Preliminary tests have shown that the extracted sets of via point arrays may be used to specify target configurations from text strings. The primary appeal of the via point technique is the suitability of its minimum jerk criterion to describing biological movements, which exhibit inherent smoothness.
 

Introduction | Audible-Visible Synthesis | Extensions to the Basic Model
Issues of Realism | Conclusion | References

 

PREVIOUS CONTENTS NEXT