Talking Heads: Speech Production

Measuring and Modeling Speech Production

State of the Art

   Connectionist models are also being applied to the mapping between acoustics and articulation. Investigators have long attempted to derive information about underlying articulation directly from the acoustic signal (Schroeder, 1967; Atal et al., 1978; Wakita, 1979; Levinson & Schmidt, 1983; Kuc et al., 1985; Shirai & Kobayashi, 1986; Sondhi & Schroeter, 1987; Larar et al., 1988; Boë et al., 1992; Schroeter & Sondhi, 1992; Beautemps et al., 1995; Badin et al., 1995). A variety of connectionist or neural network methods are being used to explore this mapping (e.g., Kawato, 1989; Jordan, 1989, 1990; Rahim & Goodyear, 1990; Bailly et al., 1990; Bailly et al., 1991; Shirai & Kobayashi, 1991; Papçun et al., 1992). Examples include the work of Rahim and Goodyear (1990), in which a multilayer perceptron was trained to map between the power spectra of vowels and consonants and the parameters of a vocal tract model in which shape is specified by setting the areas of a fixed number of acoustic tube sections. For each member of the training set, the input acoustic data consisted of 34 samples of the log power spectrum (from 100 Hz to 4000 Hz). In an analysis-by-synthesis approach, a first order gradient descent optimization procedure was used to minimize the spectral error between the target and synthesized spectra by adjusting the area values to reduce the acoustic mismatch. This mapping technique provided an efficient method for deriving vocal tract synthesis values directly from acoustic data.

   In a related approach, Bailly and colleagues (Bailly et al., 1990; Bailly et al., 1991) proposed a method of control for an articulatory synthesis model (Maeda, 1979) based on the optimization approach for motor skill learning developed by Jordan (1988, 1989, 1990) (see, also, Rahim et al., 1993). This modified version of Jordan’s sequential network is combined with constraints arising from the kinematic properties of the biological system being controlled and from the phonological task being simulated. These constraints restrict the possible solutions that the feedforward multilayered perceptron can use to model the mapping between the production of vocalic gestures and the trajectories of the first three formants. An additional feature of the approach is that it can generalize its movement pattern by interpolating new trajectories based on existing learned trajectories.

   Similarly, at Haskins a number of approaches are being used to model the development of the connection between articulation and acoustics. The interest is in studying how a dynamically changing vocal tract can have its space of potential movements constrained by "learning" the relationship between acoustics and the tract variables and/or gestural scores – a process that is possibly similar to the exploratory activities that infants engage in. Examples include work by McGowan (McGowan, 1994, 1995), who has used genetic algorithms in conjunction with the task-dynamic model to recover articulatory movements from formant frequency trajectories, and Hogden (Hogden et al., 1993; Hogden et al., in press), who uses continuity constraints in the process of recovering the relative positions of simulated articulators from speech signals generated by articulatory synthesis.

   Finally, connectionist models may be particularly well-suited to examine directly the dynamical properties of the musculo-skeletal system. Previous efforts to characterize the mapping between motor commands to muscles and the resulting behavior of speech articulators have been severely hampered in at least three ways. First, the muscles associated with speech articulation generally are either small and highly interconnected with one another – e.g., tongue muscles – or they are hard to monitor safely – e.g., masseter, the large jaw raising muscle. Thus, it is difficult to ascertain the muscle sources of electromyographic (EMG) records, which are themselves very complex. Despite use of signal conditioning and numerical techniques such as signal rectification and integration, smoothing (low-pass filtering), and ensemble averaging of multiple trials, identification of "key" events has been restricted to visually observable landmarks in the signal, such as the onset or peak of EMG activity. Interpretation of this restricted set of events has relied primarily on statistical analysis of highly variable mean values, which must then be reliably correlated with other arbitrarily chosen, discrete events in the articulator movement behavior.

   In training a neural network, for example, to obtain the forward dynamics model of the articulators, the entire EMG signal can be used as the "motor command input," not just those events that stand out visually on a display screen. Hirayama and colleagues (Hirayama et al., 1992, 1993, 1994; Vatikiotis-Bateson et al., 1991) have used real physiological data – articulator movements and EMG from muscle activity – to develop a preliminary model of speech production based on the articulatory system’s dynamical properties. A neural network, using the EMG data, learned the forward dynamics model of the articulators. After training, a recurrent network predicted articulator trajectories using the EMG signals as the motor command input and perturbation simulations were used to assess the properties of the acquired model. Such modeling implicitly assumes a causal link between muscle and movement behavior, as opposed to the null hypothesis of traditional statistical analysis. Because the goal of the network is to formalize or "learn" that link, any degree of correlation between muscle and articulator behavior is useful in determining the proper coefficients or "weights" of the model equation. In a different, but related, approach Wada and colleagues (Wada & Kawato, 1995; Wada et al., 1995) demonstrate the tight coupling between pattern formation and pattern recognition. They have developed a computational theory of movement pattern recognition, based upon a theory for optimal movement pattern generation, with examples from cursive handwriting and the estimation of phonetic timing in natural speech.

Introduction | Acoustic Theory | Measuring Production
Tract Model | Gestural Modeling | State of the Art