|Talking Heads: Speech Production|
Measuring and Modeling Speech Production
A Model of the Human Vocal Tract
Although there is great interest in studying the speech production process, some of the methods discussed in the previous section place practical limits on the amount of data that can be gathered and analyzed. In addition, speakers cannot exercise the degree of control over their articulators needed for certain studies of the contributions of individual articulators. Paralleling the method for studying speech production and perception that uses speech synthesized from acoustic parameters as a fundamental tool, we use an articulatory synthesis (ASY) system at Haskins Laboratories that synthesizes speech through control of articulatory instead of acoustic variables (Mermelstein, 1973; Rubin et al., 1981). ASY is designed for studying the linguistically and perceptually significant aspects of articulatory events. It allows the quick modification of a limited set of key parameters that control the positions of the major articulators: the lips, jaw, tongue body, tongue tip, velum, and hyoid bone position (which sets larynx height and pharynx width). The particular set of parameters provides a description of vocal-tract shape, adequate for research purposes, that incorporates both individual articulatory control and linkages among articulators. Additional input parameters include excitation (sound source) and movement timing information. An important aspect of this models design is that speech sounds for use in perceptual tests can be generated through controlled variations in timing or position parameters. Another very important aspect is that the synthesis procedure is fast enough to make interactive on-line research practical.
Figure 7 shows a midsagittal view of the ASY vocal tract, with the six key articulators labeled. These articulators can be grouped into two major categories: those whose movements are independent of the movements of other articulators (the jaw, velum, and hyoid bone); and those whose movements are dependent on the movements of other articulators (the tongue body, tongue tip, and lips). The articulators of this second group normally move when the jaw moves. In addition, the tongue tip can move relative to the tongue body. Individual gestures can thus be separated into components arising from the combined movement of several articulators. For example, the lip-closing gesture in the production of the utterance /aba/ is a function of the movement of the jaw and the lips. Movements of the jaw and velum have one degree of freedom, while all others have two degrees of freedom. Movement of the velum has two effects: it alters the shape of the oral branch of the vocal tract and, in addition, it changes the size of the coupling port to the fixed nasal tract.
Figure 7. Vocal tract outline with key parameters.
An overview of the steps involved in articulatory synthesis using our model is provided in Figure 8. Once the articulator positions have been specified (see below), the midsagittal outline is determined and can be displayed. Cross-sectional areas are calculated by superposing a grid structure fixed to the maxilla on this outline and computing the points of intersection of the outline and the grid lines. The resolution of this grid is variable, within certain limits. In general, grid lines are set 0.25 cm apart where parallel and at 5° intervals where radial. Sagittal cross-dimensions are calculated and converted to cross-sectional areas, using different formulas for estimating the shape in the pharyngeal, oral and labial regions. (Improvements to the third-dimensional representation of tract shape are expected to be made, based upon data from magnetic resonance imaging (MRI) of the vocal tract [Baer et al., 1991].) The area values are then smoothed and approximated by a sequence of uniform tubes of fixed length (0.875 cm). The number of area values is variable because the overall length of the tract varies with both hyoid height and degree of lip protrusion.
Once the area values have been obtained, the acoustic transfer function is calculated using a technique based on the model of Kelly and Lochbaum (1962), which specifies frequency-independent propagation losses within sections, and reflections at section boundaries. Nonideal terminations at the glottis, lips, and nostrils are accurately modeled. However, the effects of other conditions, such as yielding vocal-tract walls, are accounted for by introducing lumped-parameter elements at the glottis and within the nasal section.
In the interest of computational efficiency, and because the synthesizer was designed as a research tool to provide rapid feedback about changes in articulatory configuration, a number of compromises have been made in the details of the model. For example, acoustic excitation of the vocal tract transfer function is most commonly specified as an acoustic waveform, rather than provided by simulating the physiological and aerodynamic factors of phonation. In this approach, control over the glottal waveshape is limited to two parameters: the open quotient (duty cycle, i.e. ratio of open to closed portions of the waveform) and the speed quotient (the ratio of rise to fall time during the open portion) (Rosenberg, 1971). The fricative source is simulated by inserting shaped random noise anterior to the place of maximum constriction in the vocal tract. Acoustic output is obtained by supplying the glottal or fricative excitation as input to the appropriate acoustic transfer function, implemented as a digital filter. The accuracy of the phonatory model can be improved by replacing the present approach for acoustic calculation with a fully aerodynamic simulation that explicitly accounts for the propagation of sound along the tract (McGowan, 1987, 1988). This approach provides a number of benefits, including more accurate simulation of voiced and fricative excitation, interaction of source and tract effects, and more accurate modeling of side branches and losses. However it can result in slower overall calculation times. Because of such practical considerations, a choice of methods for calculating acoustic output has been implemented in the model.
Specification of the particular values for the key articulators can be provided in a number of ways. In the simplest approach, a display of the midsagittal outline of the vocal tract can be directly manipulated on screen by moving one or more of the key articulators on screen. The tract is then redrawn and areas and spectral values are calculated. This method of manual graphical modification continues until an appropriate shape is achieved. This shape can then be used for static synthesis of a vowel sound. Alternatively, individual shapes can be deposited in a table that will later be used as the basis for a "script" that specifies the kinematics of a particular utterance.
Using the method of kinematic specification, the complex acoustic effects of simple articulatory changes can be illustrated. Figure 9 shows, at the top, midsagittal vocal tract outlines for the major transition points (key frames) in the simulation of the articulations of four utterances: /banana/, /bandana/, /badnana/, /baddata/. In this contrived example, the articulations of the four utterances are produced pretty much in the same way. The only parameter that varies is the timing of velar movement. The panel at the bottom of Figure 9 shows the degree of velar port size opening and the contrasting patterns of velar timing for the four different utterances. For utterance /banana/, the velum is closed at the start, opens rapidly, and stays open throughout the rest of the utterance. In /bandana/ the pattern of velar opening is similar, except that the velum closes and opens rapidly in the middle of the utterance, during the movement from /n/ to /d/. In /badnana/ the velum stays relatively closed at the beginning of the utterance, opens during the movement from /d/ to /n/, and stays open throughout the rest of the utterance. Finally, in /baddata/ the velum stays relatively closed throughout the utterance. All of these utterances have the same general form: C V CC V C V, in which the initial consonant is /b/, and the vowel pattern is / æ, æ, uh /. With the exception of the velum, all other articulators move in the same manner across the different tokens. Note that the simple change in timing of velar opening in these four tokens results in considerable differences in the identities of the consonants that occur in the middle and near the end of the utterances. Simple changes in articulatory timing can also result in complex acoustic changes. Pseudo-spectrograms are shown for the four utterances. These displays were produced by picking the peaks of the transfer functions for each utterance, and indicating formant amplitude by height of the peak bars. Although these displays only show the coarsest detail, a wide variety of acoustic differences can be seen in the four utterances.
This example illustrates one method, albeit a very schematized one, that can be used in this articulatory synthesis model to provide kinematic specifications. In real speech, the actual production of such utterances is more variable and detailed. If desired, an attempt can be made to simulate these details by varying the models input parameters pitch pulse by pitch pulse. Alternatively, specifications for vocal tract shape and kinematic trajectories can be calculated by an underlying dynamic model (see Section V, below). In general, this latter approach is the technique most commonly used in our present simulations.
As mentioned above, the vocal tract model needs improvement in a number of ways. In addition to the enhancements that have been made in the aeroacoustic simulation, changes are being made in the articulatory representation. The choice of particular key articulators described above has proven to be too limited. For example, additional tongue shape parameters are needed to simulate tongue bunching and for more accurate control of shape in the pharyngeal region. Also, it would be desirable to be able to fit the ASY model to a variety of head shapes, including females, infants, and, potentially, other non-human primates. For this reason, a configurable version of ASY (CASY) is being developed that has increased flexibility in the internal linkages of the model, the potential for adding new parameters, and a method for fitting the shape to actual X-ray or MRI data and adjusting internal fixed parameters. Finally, research is underway to provide a more complete three-dimensional representation of tract shape. A guiding principle of this approach is that actual physiological measurements and comparisons provide the basis for improvements in both the static and dynamic aspects of the simulation of speech production. In addition to our own interest in more accurate physiological modeling for use in our model of speech production, other researchers have focused on a variety of areas including the control and dimensionality of jaw motion (Flanagan et al., 1990; Ostry & Munhall, 1994; Vatikiotis-Bateson & Ostry, 1995), and the modeling of soft tissue structures, such as the tongue (Wilhelms-Tricarico, 1995) and the lips (Abry & Boë, 1986; Badin et al., 1994; Benoît et al., 1994; Guiard-Marigny et al., in press).
Tract Model | Gestural Modeling | State of the Art