| Talking Heads: Speech Production |
|
Measuring and Analyzing Speech Production
Experimental and analytic techniques A recurrent belief is that what the listener extracts from the speech signal may be information about the production process itself a process that, despite small differences in detail, is anatomically and physiologically constrained the same way for the entire species (Liberman et al., 1967; Mattingly & Liberman, 1969; Fowler et al., 1980; Browman et al., 1984; Liberman & Mattingly, 1985; Mattingly & Liberman, 1988; Browman & Goldstein, 1992; Fowler, 1995). If the invariant aspects of production can be detected using analytic techniques, they could be used to determine the cognitive and neurophysiological underpinnings of speech behavior as well as the articulatory to acoustic mapping. Furthermore, it seems increasingly likely that the anatomical and physiological constraints on speech production, while allowing extremely complex coordinated behaviors and acoustic output, may be quite similar in form to the constraints on other biological behaviors. For example, the attempt to model the motion of the speech articulators may benefit from efforts to model other movements, such as locomotion, arm movement, and posture control. Articulatory research of the past two decades has sought to describe the correspondence between phonetics units and the spatiotemporal behavior of various speech articulators, including the attendant muscle activity, and the regulation of air supply and airflow through the larynx and, to a lesser extent, throughout the vocal tract. Whatever the actual subject of study, experiments generally have been designed to examine either the behavior of an articulator or set of articulators over time and across different perturbing contexts (e.g., different vowel and consonant combinations, different speaking rates and intonation patterns, or experimentally controlled mechanical perturbations to the articulators). The experimental observables in such studies include muscular activity, articulator motion and configuration, and the resultant acoustics. The specific articulatory structures and combinations observed have been determined primarily by technical limitations on data acquisition and analysis. Recent developments have facilitated collecting data simultaneously from many sources as outlined in Figure 4 (see also Borden & Harris, 1984; Baken, 1987; Fujimura, 1988, 1990; Hardcastle & Marchal, 1990; Kent et al., 1991; Bell-Berti & Raphael, 1995.) Although by no means exhaustive, Figure 4 provides an overview of the various transduction devices and techniques that have been used to investigate vocal tract structures during speech production. Only a few of the structures involved in speech and nonspeech vocal production are readily accessible to non-invasive external view. Motion of the lips and jaw can be transduced optoelectronically (Sonoda & Wanishi, 1982; Harrington et al., 1995; Vatikiotis-Bateson & Ostry, 1995) or using strain gauges (Abbs & Gilbert, 1973). Measurements of the lips and jaw, and other facial features, can also be made from video or film analysis examples in non-human vocal production can be found in the work of Hauser and colleagues (Hauser et al., 1993; Hauser & Schön Ybarra, 1994), where frame-by-frame video analysis was used to explore the role of mandibular position and lip configuration in rhesus monkey call productions. In human speech production, a number of other non-invasive techniques have been used. Air flow at the external boundaries of the nasal and oral cavities may be recorded using a Rothenberg mask (Rothenberg, 1977). Glottal waveforms can be externally sensed using an electroglottograph (Fourcin, 1974, 1981; Kelman, 1981; Rothenberg, 1981) or accelerometer (Sugimoto & Hiki, 1962; Askenfelt et al., 1980) strapped onto the neck in the vicinity of the thyroid cartilage. Lung volume can be measured using a spirometer (Beckett, 1971) or a body plethysmograph (Hixon, 1972) while the contributions of the ribcage and abdominal cavities to lung volume change can be evaluated using magnetometers (Mead et al., 1967), mercury strain gauges (Baken & Matz, 1973), and inductive plethysmographs (Sackner, 1980). The other techniques shown in the figure are all invasive to some extent and require cooperation of the subject, particularly with the placement of transduction devices and sensors. Using a flexible fiber-optic endoscope, the larynx can be illuminated for video and photoglottographic recording of the laryngeal structures and transglottal areas (Sawashima et al., 1970; Fujimura, 1977; Sawashima, 1977; Fujimura et al., 1979). It is also possible to place miniature pressure transducers above and below the glottis for measurement of supra- and sub-glottal pressure (Cranen & Boves, 1985; Gelfer et al., 1987). Observation of the tongue, the most versatile and complex speech articulator, has been the most difficult of all. Optoelectronics, electromagnetic inductance, ultrasound, and X-ray imaging are currently available methods used to transduce various aspects of tongue movement. For example, photoglossometry is an optoelectronic technique that measures, by reflection, the distance between the tongue surface and points on the hard palate (Chuang & Wang 1975). Electropalatography (Hardcastle, 1972; Fletcher et al., 1975; Recasens, 1984; Hardcastle et al., 1991) measures the pattern of contact between the tongue and points on the hard palate. The X-ray microbeam system tracks sagittal position of radio-opaque pellets on the surface of the tongue, lips and jaw (Kiritani et al., 1975; Nadler & Abbs, 1988; Westbury, 1994). Electromagnetic techniques (e.g. magnetometers) can be used to recover similar information through transduction of field fluctuations at multiple points on the various articulator surfaces (Hixon, 1971a, 1971b; Schönle et al., 1987; Perkell et al., 1988, 1992; Tuller et al., 1990; Löfqvist et al., 1993; Löfqvist & Gracco, 1995). Ultrasound has been used to acquire point-specific tongue data as well (Keller & Ostry, 1983; Kaburagi & Honda, 1994), but is used primarily to provide dynamic views of the tongue surface and other soft tissue structures (Morrish et al., 1985; Stone et al., 1988). While no system has yet surpassed the high resolution, sagittal view of the entire vocal apparatus provided by cineradiography (Perkell, 1969; Subtelny et al., 1972; Wood, 1979), the recent development of magnetic resonance imaging (MRI) is very promising (Baer et al., 1991; Moore, 1992; Tiede, 1993; Dang et al., 1994; Rubin et al., 1995). Although MRI has been limited to imaging the static vocal tract, as scan rates and image-enhancement techniques are improved, MRI could provide highly detailed, three-dimensional images of the vocal tract and surrounding structures during speech. The ability to record synchronous combinations of signals compensates substantially for the individual limitations of the various time-varying measurement techniques. There is a basic tradeoff between techniques that make rapid and accurate flesh-point measures of vocal tract structures e.g., using tiny pellets placed on the tongue surface or mandible in the case of the X-ray microbeam or markers placed on the lips and jaw in the case of optoelectronic position sensing devices and those that provide more global views of the vocal tract but with poor spatiotemporal resolution, such as ultrasound or MRI. For example, ultrasound has scan times fast enough (approximately 35 msec) to track the relevant motions of the tongue body, but it cannot capture tongue tip motion during production of /d,t,n/ where the tongue tip may contact the maxillary arch for as little as 15 msec. Ultrasound transduction of tongue tip gestures is further complicated by the necessity for there to be only one air-tissue boundary between the externally mounted transmitter/transducer and the articulator surface of interest. The tongue tip, when raised to the maxillary arch usually causes an additional air cavity to appear between the underside of the tongue and the mouth floor. Still, when ultrasound is combined with fast transduction systems such as the X-ray microbeam or magnetometer, a much better picture of the tongues activity emerges (e.g., Stone, 1990, 1991). Together, the wide variety of transduction devices currently available makes it possible to assess the dynamic interaction of laryngeal and supralaryngeal structures at biomechanical, neurophysiological, and acoustical levels of analyses.
A paradigm for speech research.
Tract Model | Gestural Modeling | State of the Art |