Talking Heads: Speech Production

Measuring and Modeling Speech Production

Gestural Modeling

Modeling the speech production process requires a detailed consideration not only of the static anatomical and physiological aspects of the system, but also of how it changes over time. The speech articulators are continually in motion, producing a varying acoustic stream: the perceiver is sensitive to both the local details of the resulting acoustic pattern and the global characteristics of change (Remez et al., 1981). In general, a greater emphasis has been placed on studying the static rather than the time-varying aspects of speech events. However, there has been a long-standing interest in the underlying gestural basis of speech production and its relationship to perception (Liberman et al., 1967; Mattingly & Liberman, 1969; Liberman & Mattingly, 1985; Fowler, 1995). Some of the techniques described earlier are useful for examining the kinematics of the speech articulators. Theoretical approaches to studying action systems have pointed out the necessity and desirability of examining the dynamical system that underlies kinematic patterns (Bernstein, 1967; Fowler, 1977, 1984; Turvey, 1977; Kelso et al., 1986a; Tuller & Kelso, 1995).

   Over the past several years a computational model has been developed at Haskins Laboratories that combines these intersecting concerns in the form of a tool for representing and testing a variety of theoretical hypotheses about the dynamics of speech gestures and their coordination (Browman et al., 1984; Browman & Goldstein, 1985). This approach merges a phonological model based on gestural structures (Browman & Goldstein, 1986, 1989, 1990, 1992) with an approach called task dynamics (see below) that characterizes speech gestures as coordinated patterns of goal-directed articulator movements. At the heart of both of these approaches is the notion of a gesture, which is considered in this context to be the formation of a constriction in the vocal tract by the organized activity of an articulator or set of articulators. The choice of gestural primitives is based upon observations of functional units in actual production. These models attempt to reconcile the linguistic hypothesis that speech involves an underlying sequence of "abstract, context-independent units, with the empirical observation of context-dependent interleaving of articulatory movements" (Saltzman & Munhall, 1989). The focus is on discovering the regularities of gestural patterning and how they can be specified (see, also, Perrier et al., 1991; Kröger et al., 1995; Shirai, 1993).

   The computational model has three major components. First, a gesturally-based phonological component (the linguistic-gestural model) provides, for a given utterance, a "gestural score" which consists of specifications for dynamic parameters for the set of speech gestures corresponding to the input phonetic string (Browman et al., 1986) and a temporal activation interval for each gesture, indicating its onset and offset times. These intervals are computed from the gesture’s dynamic parameters in combination with a set of phasing principles that serves to specify the temporal patterning among the gestural set (Browman & Goldstein, 1990). Second, the task dynamic model computes coordinated articulator movements from the gestural score in terms appropriate for our particular vocal tract model (ASY, described above) which, in turn, computes the speech waveform from these articulatory movements. An example of such a gestural score, for the utterance [phAm], can be seen in Figure 10. Shown are the periods of gestural activation (the filled boxes) and trajectories generated during simulations (the solid lines) for the four tract variables (see below) that are controlled in the production of this utterance: velic aperture, tongue body constriction degree, lip aperture, and glottal aperture.

   The task dynamic model used in this computational system has proved useful for describing the sensorimotor control and coordination of skilled activities of the limbs, as well as the speech articulators (Kelso et al., 1985; Saltzman, 1986; Saltzman & Kelso, 1987; Saltzman et al., 1987; Saltzman et al., 1988a, 1988b; Saltzman & Munhall, 1989; Fowler & Saltzman, 1993). For given gestures, the goal is specified in terms of independent task dimensions, called tract variables. Each tract variable is associated with the specific set of articulators whose movements determine the value of that variable. For example, one such tract variable is Lip Aperture (LA), corresponding to the vertical distance between the two lips. Three articulators can contribute to changing LA: the jaw, the upper lip, and the lower lip. The standard set of tract variables in the computational model, and their associated articulators, can be seen in Figure 11. Recently, this set has been extended by the incorporation of aerodynamic and laryngeal components to make a more realistic model of source control (McGowan & Saltzman, 1995). Tract variables and articulators compose two sets of coordinates for gestural control in the model. In addition, each gesture is associated with its own activation coordinate, whose value reflects the strength with which the associated gesture "attempts" to shape vocal tract movements at any given point in time. Invariant gestural units are posited in the form of context-independent sets of dynamical parameters (e.g., lip protrusion target, stiffness, and damping coefficients), and are associated with corresponding subsets of all three coordinate systems. Thus, the tract-variable and model articulator coordinates of each unit specify, respectively, the particular vocal tract constriction (e.g., bilabial) and the articulatory synergy that is affected directly by the associated unit’s activation. Currently the model offers an intrinsically dynamic account of interarticulator coordination within the time span of single and temporally overlapping (coproduced) gestures, under normal conditions as well as in response to mechanical perturbations delivered to the articulators.

   At the present stage of development, the task-dynamic model does not provide a dynamic account of intergestural timing patterns even for simple speech sequences. Current simulations rely on explicit gestural scores to provide the timing patterns for gestural activation intervals in simulated utterances. While such explicitness facilitates research by enabling us to model and test our current hypothesis of linguistically significant gestural coordination, an approach in which temporally ordered activation patterns are derived as implicit consequences of an intrinsic serial dynamics would provide an important step in modeling processes of intergestural relative timing. Recent computational modeling of connectionist dynamical systems has investigated the control of sequences (e.g. Grossberg, 1986; Tank & Hopfield, 1987; Jordan, 1989, 1990; Kawato, 1989, 1991; Kawato et al., 1990). Such serial dynamics is well-suited for orchestrating the temporal activation patterns of gestural units in a dynamical model of speech production.

Introduction | Acoustic Theory | Measuring Production
Tract Model | Gestural Modeling | State of the Art