haskins logo
knowledgehraphic
Speech Technology
Speech technology, such as speech synthesis, speech recognition, and speech coding, is an increasing presence in our world. Some of the time, it is an annoying presence, as it is when you can’t get a voice recognizer to recognize you. But many of the systems, like telephones themselves, work so well that we forget how recent an invention they are. Research at Haskins Laboratories has contributed to the advancement of speech technology.

When Haskins was formed in 1935, some things were known about the acoustic properties of speech. The first acoustic synthesizer, constructed by Homer Dudley at Bell Laboratories, was demonstrated at the World’s Fair in 1939. While it produced fairly convincing computer speech, it was not designed in itself to increase our knowledge of speech. The first synthesizer that could test acoustic properties easily was the Pattern Playback, developed at Haskins Laboratories by Frank Cooper in 1951. This device converts spectrographic pictures (also known as voiceprints) into sound, using either photographic copies of actual spectrograms or, alternatively, "synthetic" patterns which are painted by hand on a cellulose acetate base.

As people studied more and more spectrograms of real speech, they began to recognize which spectrogram patterns would be produced by which sound sequences. They began to try drawing just these simplified patterns on the Pattern Playback, and thus, in a sense, began investigating ways to code speech – to reduce it to the information-carrying essentials. In a systematic experiment, Pierre Delattre’s expert knowledge was formulated as rules and patterns were drawn strictly according to these rules without reference to real speech spectrograms. This resulted in the first published notion of rule-based synthesis, where the synthesizer computes what patterns should be synthesized for a given sequence of speech text, in 1959.

This is called phonemic synthesis by rule. Ignatius Mattingly developed the first algorithm to automatically compute the prosody of speech—changes in pitch that we use to help signal the syntax of a sentence. To test his algorithm, he could not use the Pattern Playback, since fundamental frequency (pitch) can not be varied on it. Instead, he incorporated his rules in the synthesizer of John Holmes from Joint Speech Research Unit in England. In 1966, prosodic rules to match British English were successfully synthesized; an American-English set of prosodic and phonemic rules was demonstrated in 1968. The rules were used in a Haskins text-to-speech synthesizer demonstrated in 1973. A later version of the rules developed by Delattre and Mattingly was used in Dennis Klatt’s formant synthesizer, developed at MIT, which gave rise to MITTalk in 1979, and KlattTalk and DECTalk in 1983. This formant synthesizer is still in use today in many text-to-speech applications.

The acoustic approach to speech is very useful, but we also wanted to know more directly how the speech articulators create the sounds we hear as speech. Paul Mermelstein developed a way to specify vocal tract shape simply in 1973; an articulatory synthesizer based on it and designed for perception studies was published in 1981 (Rubin et al.). Articulatory synthesis has continued to be studied at Haskins, through ASY and CASY. An articulatory synthesizer is in itself a model of the relation of the changing shape of the vocal tract to the changing acoustic waveform produced. The model built into a synthesizer can be tested against vocal tract and acoustic measurements on real human speakers; in turn, the data obtained from humans can be explained by reference to a model that can be more directly controlled. At present, no commercial synthesizers are based on articulatory synthesis; the gaps in our knowledge mean that they are either intelligible but with a restrictively small vocabulary, or not intelligible enough to be useable. But articulatory synthesis offers the possibility of modeling any speaker of any age, sex, or with some kinds of speech disorders, and of doing so the most directly, since the production process is modeled from direct imaging of the vocal tract. Once the basic foundations of articulatory synthesis are better worked out, it may be that the most flexible and useful speech synthesizers will take this approach rather than current approaches.

The research aimed at improving synthesis led also to many significant findings about speech perception. The formant transitions visible on spectrograms between vowels and consonants proved to be important acoustic cues to the place of the consonant, that is, to the vocal tract location where the main constriction was, such as at the lips for “p” or “b”, or at the alveolar ridge behind the teeth for “t” or “d”. The time it took after a stop consonant before voicing began in the following vowel, known as the Voice Onset Time, proved to be an important acoustic cue to whether or not the stop was voiced, for instance, to decide whether it was “p” or “b”. Knowledge of the acoustic cues for all phonemes was used in early speech recognition algorithms. Currently commercial speech recognition algorithms use statistical pattern-recognition approaches, which were found to give better recognition rates. However, different ways of incorporating knowledge about speech perception in recognizers continue to be developed in efforts to improve accuracy and reliability.

Such perception studies also contribute to coding methods, which are now much in use in cellphones and music players. Here, the opposite trend is in evidence: early coding methods typically used signal properties only, for instance, coding the amount a signal had changed by rather than each sample individually. More recent methods, which are more effective, use knowledge about speech and auditory perception to reduce the speech signal so it contains only the essential parts that we perceive.