| Talking Heads: Speech Production |
|
Measuring and Analyzing Speech Production
Historical overview of the shifting articulatory and acoustic emphases Before high resolution data transduction and recording techniques became available in the second half of this century, speech was ephemeral and its record impressionistic. The best a listener could do was to write down what was said in a notation that would allow a rough reconstruction of what transcribers thought they had heard. By the 1880s a standard orthography (International Phonetic Alphabet [IPA]) for phonetic transcription had emerged, which scholars hoped would be rich enough to transcribe utterances in any of the worlds languages precisely. This development rested firmly on the notion of symbols for minimally distinctive sound segments (e.g., the phonemes /d/ and /t/ in hid vs. hit) augmented by diacritic marking of context-specific phonetic differences; but the method was still susceptible to the biases or misperceptions of the listener/transcriber. Early methods of in vivo investigation were restricted to what could be seen (e.g., movement of the lips and jaw), felt (e.g., vibration of the larynx, gross tongue position), or learned from practiced introspection of articulator position during production (e.g., Bells Visible Speech, 1867). However, when the results of these methods were combined with those from anatomical and mechanical studies of cadavers, a great deal was correctly surmised about the relation between vocal tract shape and the resultant acoustics. This knowledge had practical applications such as teaching the deaf to speak, and provided the basic scheme for modeling articulation e.g., the vowel space, pitch dependence on pressure, elastic tension of the vocal folds, height of the larynx. By the end of the 19th century, the development of mechanical transduction techniques for slow moving events such as rhythmic motion of the jaw, thorax (respiration), and other rhythmically entrained structures (e.g., finger- and foot- tapping during speech) made fairly detailed kinematic analysis possible (Sears, 1902; Stetson, 1905). Physiological studies began as well e.g., transduction of neuromuscular events (Stetson, 1928). By modern standards, analysis of data from these studies was quite basic e.g., measures of duration and observationally inferred estimates of articulator speed and impulse force. However, during this period many interesting claims and comprehensive hypotheses were advanced concerning the organization and control of learned voluntary behaviors (see Boring, 1950, for a review). The culmination of this epoch, in which the basic research paradigm entailed primarily inference of articulatory events during production of minimally contrastive phonetic events (e.g., pod vs. pot), was the development of X-ray photography. Finally, the configuration of the entire vocal tract could be captured first statically and later dynamically (cineradiography) during speech production. Unfortunately, the events revealed by analysis of the data were very coarse-grained and difficult to quantify. The rapid development of acoustic recording techniques and the sudden awareness of the dangers of X-ray exposure in the late 1920s helped drive the shift in interest from speech articulation to acoustics. During the 1930s and 1940s magnetic recording, display, and analysis systems such as wire-recorders, the oscilloscope, and the sound spectrograph (Koenig et al., 1946) made it possible to study speech acoustic events in greater detail and revealed phoneme specific information in the acoustic patterns. In particular, vowel formants and consonant-dependent formant transitions were recognized as key components to phoneme identity, and their patterning alone was shown to be sufficient for synthesis of acceptable and distinct syllables such as /ba/, /da/, /ga/ (Cooper et al., 1951; Liberman et al., 1959). Given the ability to synthesize speech from acoustic patterns, it seemed possible to conduct meaningful research in the acoustic domain alone, regardless of the underlying articulatory configurations. Another contributor to the shift in focus from articulation to acoustics was the emergence of distinctive feature theory. Heavily influenced by information theory, phonemes were composed of distinctive features whose binary values (+/-) enabled minimal phonemic and, therefore, informational contrasts (Jakobson et al., 1963). Thus, voicing pairs such as /p,b/, /t,d/, and /k,g/ are distinguished by the value of the voicing feature alone. Although feature detection was not restricted to the acoustic domain, the distinctive features role as the information bearing element in the process of perceiving speaker intention required that the underlying message be encoded in the acoustic properties of the signal. This led quite naturally to the assumption that the medium of communicative interaction was strictly acoustic. Following these developments was an intense effort to decompose acoustic signals into minimally contrastive cues to phonetic intent (e.g., Lisker, 1957; Abramson & Lisker, 1965; Liberman et al., 1967; Liberman & Studdert-Kennedy, 1978; Repp, 1983, 1988; OShaughnessy, 1987). However, despite demonstration of the perceptibility of a variety of acoustic cues, there were a number of problems. Taken together, they suggest that consideration of the variable speech acoustics alone may not reveal how listeners can arrive at the same percepts. One problem was that speaker acoustics are highly variable both within and across speakers. Not only are the acoustics of different speakers variable, but no two utterances by the same speaker are the same. Another problem was the difficulty in finding acoustic cues that persist across the range of phonetic contexts. Thus, the relatively long time between the release of a voiceless stop (/p,t,k/) and the onset of voicing for a following vowel (as in /pa/) clearly distinguishes the stop from its voiced counterpart (/b,d,g/) if the syllable is stressed. In other contexts, this voice onset time (VOT) is not as strong a cue to voicing or may be absent entirely (e.g., before a pause). The search for persistent cues led to a third problem; namely, multiple cues to the same phonetic feature that overlap or trade off depending on the context, speaker, etc. For example, other cues to consonant voicing are vowel quality and duration (Lisker & Abramson, 1967). The inability to identify unique cues for specific features suggested that the mapping between phonetic categories and acoustic distinctions might be many-to-many. Finally, recent research has shown that visual as well as acoustic information can be useful in speech perception (Benoît et al., 1992, 1994; Massaro et al., 1993; Sekiyama & Tohkura, 1993). Indeed, the two modalities appear to complement each other in that some of the acoustically more unstable cues e.g., cues to place of articulation that reside, often very briefly, in the spectral fine structure are visually the most consistently perceived; and vice versa for acoustically more stable cues such as nasality (for review, see Summerfield, 1987, 1991). For articulation research, recognition of the mismatch between variable acoustic events and phonetic categories coincided with an improvement in articulatory transduction techniques (e.g. subject safety, measurement accuracy, and reduced cost) and the awareness of a growing corpus of physiological data from studies of other biological movement systems that suggested lawful constraints on the organization and production of biological behavior. Coupled with the notion that communicative intent must go through the production structure before any acoustic "encoding" can occur, this has led to a current emphasis on considering phonetic abstractions in terms of underlying articulatory behavior. Consider as an example production of the bilabial /b/ between two vowels (e.g. /aba/. Acoustically, no two productions are identical, even when produced by the same speaker. Yet, the articulatory event is relatively simple. In each case the lips come together stopping the airflow from the vocal tract and then are released again for the next vowel. Vocal fold vibration, necessary for the two vowels, may or may not stop during the closure period for the /b/. This depends on whether or not the lips are closed long enough for the air pressure above the glottis to become the same as the pressure below the glottis, at which point the vocal folds will cease to vibrate. Thus, at least some of the acoustic variability can be accounted for by observing the timing of lip closure and release relative to aerodynamic factors such as air flow and intraoral pressure. Another example is the difference between oral /d/ and nasal /n/.
Both of these sounds are produced by placing the tip of the tongue against the alveolar
(maxillary) ridge and/or front teeth. Articulatorily they are distinguished by
whether or not the velar port is open; if open, the result is nasal.
Acoustically, however, the difference is quite complex. In early articulatory
synthesis, the degree of velar port opening was systematically varied to produce
ordered steps along a continuum from acoustically perceptible /da/ to /na/ (Rubin
et al., 1981) by simply increasing velar opening from completely closed to
the degree of opening necessary for an acceptable /na/. To produce a similar
effect with acoustic synthesis requires simultaneous control over a variety of
parameters including frequencies and bandwidths for the three oral resonances as
well as the frequencies and bandwidths of the nasal resonance and the nasal
anti-resonance. Another example of how simple variations in articulatory movement
can result in complex changes in both acoustic and phonetic detail is provided in
in the section on modeling the human vocal tract.
A paradigm for speech research.
Tract Model | Gestural Modeling | State of the Art |