| We want to build systems that are satisfying to talk to, that will feel attentive, supportive and responsive to users. | |
| To do this we need to model `real-time social skills', primarily the ability of a person to infer the other's needs, intentions, and feelings at the sub-second level. This isn't mind-reading, it just requires sensitivity to the non-verbal cues produced unconsciously while speaking. | |
| The main challenge in building systems to do this is that of discovering the cues and rules that people use, since the details of interaction at this level are below conscious attention, and far from trivial to uncover. Indeed, most work in speech systems engineering, as in linguistics, ignores these issues. Thus the state of the art is, approximately, the sort of robotic formal interaction common in androids in science fiction movies, where the two conversants produce complete sentences and take turns rigidly. |
So far we have built systems which produce back-channels (uh-huh etc.) at natural timing, which chose appropriate acknowledgements (right, yeah, good, etc.) based on the user's ephemeral emotions, which pace an explanation adaptively to the user's needs, and which control a simulated spaceship in response to prosody of the user's advice. We are currently analyzing more phenomena and designing spoken dialog systems capable of other `sensitive' and natural interactions.
| [clear-throat] | ai | hh-aaaah | iiyeah | okay | nuuuuu | ukay | uam | uumm | yeahh | |
| [click] | am | hhh | m-hm | okay-hh | nyaa-haao | um | uh | uun | yeahuuh | |
| [click]neeu | ao | hhh-uuuh | mm | ooa | nyeah | um-hm-uh-hm | uh-hn | uuuh | yegh | |
| [click]nuu | aoo | hhn | mm-hm | ookay | o-w | umm | uh-hn-uh-hn | uuuuuuu | yeh-yeah | |
| [click]ohh | aum | hmm | mm-mm | oooh | oa | ummum | uh-huh | wow | yei | |
| [click]yeah | eah | hmmmmm | mmm | ooooh | oh | unkay | uh-mm | yah-yeah | yo | |
| [inhale] | ehh | hn | myeah | oop-ep-oop | oh-eh | unununu | uh-uh | ye | yyeah | |
| achh | h-nmm | hn-hn | nn-hn | u-kay | oh-kay | uu | uh-uhmmm | yeah | ||
| ah | haah | huh | nn-nnn | u-uh | oh-okay | uuh | uhh | yeah-okay | ||
| ahh | hh | i | nu | u-uun | oh-yeah | uum | uhhh | yeah-yeah |
It seems that these items are in part compositional, in that each component sound brings a corresponding component of meaning.
Examples illustrating how each sound bears the same meaning across different contexts appear at the Conversational Grunts Homepage.
The model itself is described in Non-Lexical Conversational Sounds in American English.
Interestingly, sound-symbolism appears to be common in non-lexical utterances in Japanese also,
as reported in
The Relationship between Sound and Meaning in Japanese
Back-channel Grunts.
Recently we have found experimental evidence for the existence of sound-symbolism: Nasalization in Japanese Back-Channels bears Meaning, Nigel Ward and Masafumi Okamoto. International Congress of the Phonetic Sciences, 2003 final (pdf)
Regardless of how one choses to model these items, labeling them consistently is a challenge. A simple set of Phonetic Labeling Guidelines is, however, almost always adequate. The issues involved in labeling are discussed further in: Issues in the Transcription of English Conversational Grunts. Nigel Ward. 1st SIGdial Workshop on Discourse and Dialogue. ACL. 2000. (abstract, postscript, pdf)
Getting multiple labelers to be able to identify the pragmatic functions of these items consistently is a challenge. Currently our best tagset is described in the Pragmatic Function Labeling Guidelines.
Performance on many practical tasks suffers because non-lexical items
present problems for speech recognition, speech synthesis, and dialog management,
so the need for better models is a real one:
The Challenge of Non-lexical Speech Sounds.
For example, the automatic number-giving that comes at the end of
directory assistance calls is at a fixed rate; to slow for some people, and too fast for others.
This can be adapted automatically based on the user's speaking rate and response latency.
Automatic User-Adaptive Speaking Rate Selection for Information Delivery.
Nigel Ward and Satoshi Nakagawa. International Conference on Spoken Language
Processing 2002. (pdf)
Sensitive choice of acknowledgements based on
the context and the user's prosody, can make a tutorial system seem more supportive.
This work is described concisely in
Responding to Subtle, Fleeting Changes
in the User's Internal State.
Paying attention to the prosody of utterances in a cooperative task can be useful.
Design for a System able to use Time-Critical Spoken Advice.
Shunsuke Soeda and Nigel Ward.
Fifteenth National Conference of the Japanese Society for Artificial Intelligence, 2001.
(abstract).
(paper).
Also, replicating Schmandt's early work, we confirmed that
it is possible to pace an interaction appropriately
using only the duration and pitch of the user's utterances.
Pacing Spoken Directions to Suit the Listener.
Tatsuya Iwase and Nigel Ward.
5th International Conference on Spoken Language Processing (ICSLP-98).
We have built ``Aizula'', a system that can produce back-channel
feedback, such as uh-huh and mm, as well as a human, in some cases.
A Responsive Dialog System.
The key rule has been used in the MIT Media Lab's GrandChair system and tested in the Virtual Human at USC-ICT (link).
To discover the rules governing back-channel behavior we had to
build a system for analyzing dialog phenomena, ``didi''.
A full explanation appears as
Prosodic Features which Cue Back-channel Responses in English and
Japanese.
Recently we have discovered rules governing back-channeling in Arabic and Spanish.
We are recording human-human dialogs in controlled domains,
analyzing the prosodic and contextual cues that humans use, and
seeking to interpret these cues as expressing pragmatic
dimensions of the interaction. The result will be a model of
real-time interpersonal interaction as manifested in spoken dialog.
This model will be useful for the development of more usable systems
for voice access to information. The findings may also support the
construction of spoken dialog systems for more challenging dialog
types, such as teaching, advising and selling.
We welcome collaboration, from students and others, and have funds to support this.
It also ties to the idea of reactive systems.
In human language there seems to be a direct link between perception
and action for back-channel feedback and other typically non-lexical utterances. This
suggests that Brooks' approach to the study and
and synthesis of physical behavior is relevant also for social
behavior.
For spoken language systems, this implies that real-time responsiveness is a priority,
as indeed it is also for the scientific study of language.
This argument is spelled out in Responsiveness in Dialog and Priorities for Language Research.
Recently many researchers have begun to pay attention to these phenomena,
as witnessed by the Special Session on the Prosody of Turn-Taking and Dialog Acts at Interspeech 2006 (link), and by the International Workshop on Cross-cultural and Culture-specific Aspects of Conversational Backchannels and Feedback, also in 2006 (link).
Topic 2: The Meanings of Prosody
Prosody can reveal the user's intention, attitude, and feelings, and systems
can exploit this information.
Topic 3: The Timing of Back-channel Feedback
Current Activities
Current spoken dialog systems are generally not pleasant to interact
with. While human interlocutors can deftly negotiate and control
pace, and smoothly signal understanding, control intentions, attitude,
etc., most dialog systems deal poorly, if at all, with these
dimensions of interaction. Lacking this, dialogs tend to be stilted,
awkward and frustrating, tend to demand careful attention, and tend to
be time-inefficient. To address these problems, this research program
seeks to develop and evaluate techniques which allow dialog systems to
interpret and generate non-verbal and other indications of attitude,
feeling, etc., thereby improving these real-time aspects of system
usability.
General Background
This line of work of work is enabled by recent advances in computing power
and in speech recognition, but it also has deep roots (historical perspective).
Nigel's Home Page