



Karen Ward and David G. Novick
Spoken language interfaces
As spoken language interfaces for real-world systems
become a practical possibility, it has become apparent that
such interfaces will need to draw on a variety of cues from
diverse sources to achieve a robustness and naturalness
approaching that of human performance
[1]. However, our
knowledge of how these cues behave in the aggregate is
still tantalizingly sketchy. We lack a strong theoretical basis
for predicting which cues will prove useful in practice and
for specifying how these cues should be combined to signal
or cancel out potential interpretations of the communicative
signal. In the research program summarized here, we propose
to develop and test an initial theory of cue integration
for spoken language interfaces. By establishing a principled
basis for integrating knowledge sources for such interfaces,
we believe that we can develop systems that perform better
from a computer-human interaction standpoint.
Historically, spoken language understanding research
developed from the speech recognition (SR) tradition with
its emphasis on identifying words and phrases. Current systems
have largely evolved through a series of heuristics-driven
enhancements to existing speech recognizers (e.g.,
[3], [4],
[9]). This evolution has led to what we term the
"SR bias," an emphasis on "getting the words right" as the
measurable goal of a spoken language component; system
enhancements are viewed as "improving the performance
of the recognizer."
We believe that this SR-centered approach does not offer a
sound theoretical basis for understanding how cues from
various sources contribute to or eliminate possible interpretations
of an utterance and how these cues might be usefully
exploited in systems. We believe that spoken language
understanding is better viewed as a computer-human interaction
problem. The goal should be to enable the system as
a whole to respond reasonably even when the recognizer--or
any other component--performs poorly. Our research
program, therefore, calls for understanding how various
cues contribute to system performance in the context of
spoken language interfaces to task-oriented mixed-initiative
systems. We furthermore join others (e.g.,
[2]) in
asserting that such systems are best evaluated as interfaces
and judged in terms of their success in supporting users in
accomplishing tasks.
Recently we have seen an increase in research probing specific
relationships between some of the knowledge sources
used in spoken communication; a brief review may be found in
[11]. In summary, however, we note that although
several studies have shown relationships between pairs of
various potential cues, none have attempted to study more
complex interactions or to test the practical application of
their findings in a working system. In this research program
we are studying the interrelationships of four cues:
Current systems rely primarily on lexicalization to signal
speaker intention, with the context of the preceding utterance
providing additional constraints (e.g.,
[13]). Pause
length is a strong marker for syntactic structure in professionally
read speech ([8], [10]).
We lack computational
models for understanding pause cues in spontaneous
speech, however; existing systems simply ignore pause.
Pitch changes offer additional cues about the speaker's
intentions. Pierrehumbert and Hirschberg
[7] proposed that
phrasal tunes signal relationships between the propositional
content and the mutual beliefs of the participants. More
specifically, Nakajima and Allen
[5] examined the relationship
between fundamental frequency (F0) and discourse
structure in spontaneous task-oriented dialogue and found
that F0 values tend to signal topic shift and topic continuation
across pause boundaries. Pitch accents mark salient
material [7], which may be useful not only in interpreting
the intention behind the utterance but also in locating critical
content words for recognition purposes.
We furthermore expect these cues to be of practical use in
the context of a spoken language interface in that they are
available and relatively robust in existing systems. In a system
expected to participate in real-time conversational
interaction, it will be important to exploit low-level cues
that are robust and fast to process so that slower and more
complex analysis can be reserved for those inputs that
require it.
We are investigating the relative contributions of these cues
to the recognition of the acknowledgment speech act.
Acknowledgments play an important role in mixed-initiative
conversation in assuring conversants that the dialogue
is on track [6]. It is important,
then, for a spoken language understanding system to recognize
and respond to acknowledgments appropriately.
In the work completed to date, we examined prosodic characteristics
of a word used in several distinct senses, one
sense being to signal an acknowledgment. Our results indicate
that intonation as reported by a pitch tracker can aid in
disambiguating senses of homonyms such as different
usages of the word "right." We did not find pitch change
alone to be an adequate discriminator of word usage; if
used as the sole cue, it correctly categorized 67% of the
occurrences. The usefulness of this finding lies in considering
local pitch change as one of several redundant cues. For
example, the direction of pitch change could serve as a confirming
cue when analyzing ambiguous or erroneous recognizer output. Details
of this study may be found in
[12].
We are now expanding the prosodic study to encompass
other acknowledgment acts and to account for the contribution
of the four cues identified above in recognizing
acknowledgments in mixed-initiative task-oriented dialogue.
To assess the usefulness of our findings, we will
implement them in the context of a working system with a
spoken language interface. Assessment will be based on a
comparison of two versions of the system, one which recognized
acknowledgments based on the four cues we studied and one which
uses only lexicalization. We plan to use a
within-subject design with each subject using both versions
of the system to complete several tasks. Metrics will be
slightly modified from those proposed by Goodine et al.
[2]:
A second experiment will probe details of the cue interrelationships.
Subjects will complete several scheduling tasks
using versions of the system in which one or more of the
cues are ignored. Metrics, tasks and experimental design
will be as in the previous experiment. An important result
expected from this experiment will be estimates of the reliability
of the various cues under realistic conditions.
Our larger goal is to develop robust spoken language interfaces.
We believe that this task depends crucially on developing
a sound theoretical basis for incorporating and
combining cues from diverse sources. We judge our success
in terms of the improvements seen in spoken language
interfaces to task-oriented systems. From this basis we can
design systems that respond effectively to the complex
communicative event we call spoken language.
12. Ward, K. & Novick, D. G. (1995). "Prosodic Cues to
Word Usage." to appear in ICASSP-95.
Abstract
Introduction
References