|
EXTENSIONS TO THE BASIC MODEL
In this section, the scope of the facial
animation model is expanded to other areas of our research
in audiovisual speech production.
Acoustic Synthesis From Faces
In addition to driving facial animations, the facial
motion data can be used to synthesize the speech
acoustics through their correlation with the amplitude
and spectral properties of the acoustics. The
multilinear techniques used to determine these correlations
are described in detail elsewhere [7, 11,
20]. Briefly, even a smaller number of position locations
(11-12) than the number used here (18) is
sufficient to generate intelligible acoustics entirely
from the face. What is crucial to the synthesis, however,
is that points from the chin, the lips, and the
cheek be used.
Recovery Of Tongue Positions
The importance of the cheek region can also be
seen in the recovery of vocal tract configurations
from the facial motion data described in the same
studies. By aligning vocal tract (midsagittal lips,
tongue and jaw) and facial data collected on different
occasions from the same speaker for the same
utterances, tongue position could be estimated from
the facial motion at better than 83% reliability.
Particularly surprising was that the tongue tip could be
recovered at about 90% reliability. Removal of the
cheek positions from the estimation substantially
reduced the strength of the correlation with the
tongue, further demonstrating that the visible correlates of
speech are not restricted to the lips and
chin. It should be noted that from the standpoint of
causality, estimation of vocal tract motion from
facial motion is actually an inversion. The 'forward'
estimation is that of the face from the vocal tract
and has been done for an English speaker at better
than 95% overall [7].
Synthesis Of The Tongue Tip
The ability to recover the tongue tip motion
from the face also suggests that a synthetic tongue
tip could be realistically parametrized by the same
facial motion data currently being used to animate
the face. This will be implemented soon along with
upper and lower dental arches.
Access To The Physiology
This speaker and four other speakers of French
and English have been recorded for similar tasks
using unilateral arrays of 11-12 position sensors,
but with the addition of hooked-wire muscle EMG
inserted into 8-9 orofacial muscles on the opposite
side of the face. These studies, which are part of a
long-range study of speech motor control [21-23],
have shown that facial motion can be estimated
from muscle EMG. Using simple autoregressive
models (second-order AR) and a short delay (< 20
ms), facial motions can be estimated at better than
80% reliability [e.g., 11]. In fact, these same data
are used to drive the muscle-based model of Lucero
and colleagues described below [9]. Taken together,
the high correlations among facial and vocal tract
kinematics and orofacial muscle EMG suggest a
single scheme of neuromotor control for the production of
audiovisual behavior [for discussion, see 5].
Text-To-Audible-Visible Speech
As an extension of the via point analysis
technique, the facial animation model can be driven
concatenatively from text input using a codex of
phoneme-specific via point arrays [10, 19].
Triphone-sized via point arrays are being extracted
from recited sentence data such as those used in the
analysis of the current data as well as much larger
sets of semi-spontaneous utterances. Preliminary
tests have shown that the extracted sets of via point
arrays may be used to specify target configurations
from text strings. The primary appeal of the via
point technique is the suitability of its minimum
jerk criterion to describing biological movements,
which exhibit inherent smoothness.
Introduction
|
Audible-Visible Synthesis
|
Extensions to the Basic Model
Issues of Realism
|
Conclusion
|
References
|