CHI '95 ProceedingsTopIndexes

Integrating Multiple Cues for Spoken Language Understanding

Karen Ward and David G. Novick

Oregon Graduate Institute of Science & Technology
20000 NW Walker Road,
Beaverton, Oregon 97006 USA, (503) 690-1121



Spoken language interfaces


As spoken language interfaces for real-world systems become a practical possibility, it has become apparent that such interfaces will need to draw on a variety of cues from diverse sources to achieve a robustness and naturalness approaching that of human performance [1]. However, our knowledge of how these cues behave in the aggregate is still tantalizingly sketchy. We lack a strong theoretical basis for predicting which cues will prove useful in practice and for specifying how these cues should be combined to signal or cancel out potential interpretations of the communicative signal. In the research program summarized here, we propose to develop and test an initial theory of cue integration for spoken language interfaces. By establishing a principled basis for integrating knowledge sources for such interfaces, we believe that we can develop systems that perform better from a computer-human interaction standpoint.


Historically, spoken language understanding research developed from the speech recognition (SR) tradition with its emphasis on identifying words and phrases. Current systems have largely evolved through a series of heuristics-driven enhancements to existing speech recognizers (e.g., [3], [4], [9]). This evolution has led to what we term the "SR bias," an emphasis on "getting the words right" as the measurable goal of a spoken language component; system enhancements are viewed as "improving the performance of the recognizer."

We believe that this SR-centered approach does not offer a sound theoretical basis for understanding how cues from various sources contribute to or eliminate possible interpretations of an utterance and how these cues might be usefully exploited in systems. We believe that spoken language understanding is better viewed as a computer-human interaction problem. The goal should be to enable the system as a whole to respond reasonably even when the recognizer--or any other component--performs poorly. Our research program, therefore, calls for understanding how various cues contribute to system performance in the context of spoken language interfaces to task-oriented mixed-initiative systems. We furthermore join others (e.g., [2]) in asserting that such systems are best evaluated as interfaces and judged in terms of their success in supporting users in accomplishing tasks.

Recently we have seen an increase in research probing specific relationships between some of the knowledge sources used in spoken communication; a brief review may be found in [11]. In summary, however, we note that although several studies have shown relationships between pairs of various potential cues, none have attempted to study more complex interactions or to test the practical application of their findings in a working system. In this research program we are studying the interrelationships of four cues:

Current systems rely primarily on lexicalization to signal speaker intention, with the context of the preceding utterance providing additional constraints (e.g., [13]). Pause length is a strong marker for syntactic structure in professionally read speech ([8], [10]). We lack computational models for understanding pause cues in spontaneous speech, however; existing systems simply ignore pause. Pitch changes offer additional cues about the speaker's intentions. Pierrehumbert and Hirschberg [7] proposed that phrasal tunes signal relationships between the propositional content and the mutual beliefs of the participants. More specifically, Nakajima and Allen [5] examined the relationship between fundamental frequency (F0) and discourse structure in spontaneous task-oriented dialogue and found that F0 values tend to signal topic shift and topic continuation across pause boundaries. Pitch accents mark salient material [7], which may be useful not only in interpreting the intention behind the utterance but also in locating critical content words for recognition purposes.

We furthermore expect these cues to be of practical use in the context of a spoken language interface in that they are available and relatively robust in existing systems. In a system expected to participate in real-time conversational interaction, it will be important to exploit low-level cues that are robust and fast to process so that slower and more complex analysis can be reserved for those inputs that require it.

We are investigating the relative contributions of these cues to the recognition of the acknowledgment speech act. Acknowledgments play an important role in mixed-initiative conversation in assuring conversants that the dialogue is on track [6]. It is important, then, for a spoken language understanding system to recognize and respond to acknowledgments appropriately.

In the work completed to date, we examined prosodic characteristics of a word used in several distinct senses, one sense being to signal an acknowledgment. Our results indicate that intonation as reported by a pitch tracker can aid in disambiguating senses of homonyms such as different usages of the word "right." We did not find pitch change alone to be an adequate discriminator of word usage; if used as the sole cue, it correctly categorized 67% of the occurrences. The usefulness of this finding lies in considering local pitch change as one of several redundant cues. For example, the direction of pitch change could serve as a confirming cue when analyzing ambiguous or erroneous recognizer output. Details of this study may be found in [12].

We are now expanding the prosodic study to encompass other acknowledgment acts and to account for the contribution of the four cues identified above in recognizing acknowledgments in mixed-initiative task-oriented dialogue. To assess the usefulness of our findings, we will implement them in the context of a working system with a spoken language interface. Assessment will be based on a comparison of two versions of the system, one which recognized acknowledgments based on the four cues we studied and one which uses only lexicalization. We plan to use a within-subject design with each subject using both versions of the system to complete several tasks. Metrics will be slightly modified from those proposed by Goodine et al. [2]:

A second experiment will probe details of the cue interrelationships. Subjects will complete several scheduling tasks using versions of the system in which one or more of the cues are ignored. Metrics, tasks and experimental design will be as in the previous experiment. An important result expected from this experiment will be estimates of the reliability of the various cues under realistic conditions.

Our larger goal is to develop robust spoken language interfaces. We believe that this task depends crucially on developing a sound theoretical basis for incorporating and combining cues from diverse sources. We judge our success in terms of the improvements seen in spoken language interfaces to task-oriented systems. From this basis we can design systems that respond effectively to the complex communicative event we call spoken language.


1. Cole, R. A., Hirschman, L. et al (1992). "Workshop on Spoken Language Understanding," Oregon Graduate Institute Technical Report No. CS/E 92-014.

2. Goodine, D., Hirschman, L., Polifroni, J., Seneff, S., & Zue, V. (1992). "Evaluating Interactive Spoken Language Systems," Proceedings of the 1992 International Conference on Spoken Language Processing (ICSLP 92) , pp. 197-200.

3. Howells, T., Friedman, D., & Fanty, M. (1992). "Broca, An Integrated Parser for Spoken Language," Proceedings of the 1992 International Conference on Spoken Language processing (ICSLP 92), pp. 325-328.

4. Issar, S. & Ward, W. (1993). "CMU's Robust Spoken Language Understanding System," Eurospeech `93, pp. 2147-2150.

5. Nakajima, S. & Allen, J. (1993). "A Study on Prosody and Discourse Structure in Cooperative Dialogues," Rochester Tech Report No TRAINS-TN93-2, Sept. 1993.

6. Novick, D. G. & Sutton, S. (1994). "An Empirical Model of Acknowledgment for Spoken-Language Systems," in Proceedings of the 32nd Annual meeting of the Association for Computational Linguistics, pp. 96-101.

7. Pierrehumbert, J. & Hirschberg, J. (1990). "The Meaning of Intonational Contours in the Interpretation of Discourse," in Intentions in Communication, P. Cohen, J. Morgan, & M. Pollack (Eds.), Chapter 14, pp 271-311, Cambridge, MS:MIT Press.

8. Price, P., Ostendorf, M., Shattuck-Hufnagel, S., & Fong, C. (1991). "The Use of Prosody in Syntactic Disambiguation," in Proceedings of the Fourth DARPA Workshop on Speech and Natural Language, Patti Price (Ed.).

9. Seneff, S. (1992). "TINA: A Natural Language System for Spoken Language Applications,"Computational Linguistics, Vol 18, No. 1, pp 61-86.

10. Wang, M. Q. & Hirschberg, J. (1992). "Automatic Classification of Intonational Phrase Boundaries," Computer Speech and Language, Vol. 6, pp. 175-196.

11. Ward, K. & Novick, D. G. (1994). "On the Need for a Theory of Integration of Knowledge Sources for Spoken Language Understanding." Proceedings of the AAAI-94 Workshop on the Integration of Natural Language and Speech Processing, July 1994, pp. 23-30.

12. Ward, K. & Novick, D. G. (1995). "Prosodic Cues to Word Usage." to appear in ICASSP-95.

13. Young, S. & Ward, W. (1993). "Semantic and Pragmatically Based Re-Recognition of Spontaneous Speech," Eurospeech `93, pp. 2243-2246.