Linguistic Gestural Model

The Linguistic Gestural Model represents the linguistic phonetic structure of any English utterance as an organized pattern of gestures; from such a pattern, movements of simulated vocal tract articulators can be computed, and the resulting acoustic waveform can be played to listeners. Thus, it is possible to generate stimuli for experiments with known gestural and (model) articulatory properties.

There are three major components in the computational model.

(1) The linguistic gestural model takes an arbitrary ARPAbet string as input and computes a gestural score: a representation of the gestures that are employed in that utterance, and how they are arrayed in time. Each gesture is an abstract characterization of coordinated task-directed vocal-tract actions (phonetic gestures); it is represented as a set of parameter values for a "task" dynamic control regime
(Saltzman, 1986), and an activation interval during which that regime molds the behavior of the vocal tract.

(2) The task dynamic model (Saltzman, 1986; Saltzman & Kelso, 1987) calculates the coordinated articulator motions that unfold over time in response to the active set of dynamic control regimes. Within the model, context effects on a gesture's movements emerge automatically from the interaction (or blending) of concurrently active (invariant) gestural regimes.

(3) The vocal tract model (ASY) computes the speech waveform from the articulator movements, using a software articulatory synthesizer (Rubin, Baer and Mermelstein, 1981).

The linguistic gestural model includes (i) a lexicon that lists the sets of gestures that correspond to the syllables of English, represented in terms of gestural descriptors, (ii) an interpreter that allows rules to be written for assigning dynamic parameters to the gestures (based on their descriptors) and for organizing the gestures into gestural scores, and (iii) programs that apply the actions specified by the rules. We have based our lexicon on the set of English demisyllables prepared by Lovins, Macchi, & Fujimura (1979), and, in principle, have the capability to generate any utterance in English.

The rules are of two types, dynamic and phasing. The dynamic rules specify the values of the control regime parameters for damping, stiffness, and rest position (target) for each gesture. Gestural stiffnesses are based on analyses of articulatory data. Typically, consonants are assigned twice the stiffness of vowels. These stiffnesses are then decreased for primary stressed syllables, and increased for reduced syllables. At present, all gestures are critically damped (except for glottal opening-and-closing gestures, which are tentatively considered to be undamped). The targets are determined empirically on the basis of perceptual evaluation of the vocal tract model's output.

The phasing rules use the concept of phase (Kelso & Tuller, 1987) to specify the relationships among individual gestures. They are based on data analyses described in Browman and Goldstein (1987). Our strategy has been to develop rules for phasing oral consonant and vowel gestures with respect to one another; velic and glottal gestures are then phased relative to the oral gestures. The rules are based on articulatory analyses of simple utterances with bilabial consonants and vowels only, but they generalize successfully to a wide variety of other utterances. Specifically, a sequence of oral consonantal gestures is phased so that the target offset of one gesture coincides with the onset of the following gesture; (full) vowel gestures begin when the target of the first consonant in a preceding consonant cluster is achieved. Given the phasing among gestures, and the inherent time constant of each gesture (set by its stiffness), activation intervals are computed for each gesture: the beginning and ending points in time during which the control regime influences the vocal tract articulators.

Back to Information or ASY home page