The Responsive Systems Project

We want to build systems that are satisfying to talk to, that will feel attentive, supportive and responsive to users.
To do this we need to model `real-time social skills', primarily the ability of a person to infer the other's needs, intentions, and feelings at the sub-second level. This isn't mind-reading, it just requires sensitivity to the non-verbal cues produced unconsciously while speaking.
The main challenge in building systems to do this is that of discovering the cues and rules that people use, since the details of interaction at this level are below conscious attention, and far from trivial to uncover. Indeed, most work in speech systems engineering, as in linguistics, ignores these issues. Thus the state of the art is, approximately, the sort of robotic formal interaction common in androids in science fiction movies, where the two conversants produce complete sentences and take turns rigidly.

So far we have built systems which produce back-channels (uh-huh etc.) at natural timing, which chose appropriate acknowledgements (right, yeah, good, etc.) based on the user's ephemeral emotions, which pace an explanation adaptively to the user's needs, and which control a simulated spaceship in response to prosody of the user's advice. We are currently analyzing more phenomena and designing spoken dialog systems capable of other `sensitive' and natural interactions.


Topic 1: Non-Lexical Utterances

There are many sounds in conversation that are not words (see table). What are all these things? Why are there so many? What do they mean?

[clear-throat] ai hh-aaaah iiyeah okay nuuuuu ukay uam uumm yeahh
[click] am hhh m-hm okay-hh nyaa-haao um uh uun yeahuuh
[click]neeu ao hhh-uuuh mm ooa nyeah um-hm-uh-hm uh-hn uuuh yegh
[click]nuu aoo hhn mm-hm ookay o-w umm uh-hn-uh-hn uuuuuuu yeh-yeah
[click]ohh aum hmm mm-mm oooh oa ummum uh-huh wow yei
[click]yeah eah hmmmmm mmm ooooh oh unkay uh-mm yah-yeah yo
[inhale] ehh hn myeah oop-ep-oop oh-eh unununu uh-uh ye yyeah
achh h-nmm hn-hn nn-hn u-kay oh-kay uu uh-uhmmm yeah
ah haah huh nn-nnn u-uh oh-okay uuh uhh yeah-okay
ahh hh i nu u-uun oh-yeah uum uhhh yeah-yeah

It seems that these items are in part compositional, in that each component sound brings a corresponding component of meaning.

Examples illustrating how each sound bears the same meaning across different contexts appear at the Conversational Grunts Homepage.

The model itself is described in Non-Lexical Conversational Sounds in American English. Nigel Ward. Pragmatics and Cognition, 14:1 (2006), 113-184. abstract, draft pdf, draft html, audio samples

Interestingly, sound-symbolism appears to be common in non-lexical utterances in Japanese also, as reported in The Relationship between Sound and Meaning in Japanese Back-channel Grunts. Nigel Ward. 4th Meeting of the (Japanese) Association for Natural Language Processing. 1998. (pdf)

Recently we have found experimental evidence for the existence of sound-symbolism: Nasalization in Japanese Back-Channels bears Meaning, Nigel Ward and Masafumi Okamoto. International Congress of the Phonetic Sciences, 2003 final (pdf)

Regardless of how one choses to model these items, labeling them consistently is a challenge. A simple set of Phonetic Labeling Guidelines is, however, almost always adequate. The issues involved in labeling are discussed further in: Issues in the Transcription of English Conversational Grunts. Nigel Ward. 1st SIGdial Workshop on Discourse and Dialogue. ACL. 2000. (abstract, postscript, pdf)

Getting multiple labelers to be able to identify the pragmatic functions of these items consistently is a challenge. Currently our best tagset is described in the Pragmatic Function Labeling Guidelines.

Performance on many practical tasks suffers because non-lexical items present problems for speech recognition, speech synthesis, and dialog management, so the need for better models is a real one: The Challenge of Non-lexical Speech Sounds. Nigel Ward. International Conference on Spoken Language Processing, 2000 (abstract, postscript, pdf)


Topic 2: The Meanings of Prosody

Prosody can reveal the user's intention, attitude, and feelings, and systems can exploit this information.

For example, the automatic number-giving that comes at the end of directory assistance calls is at a fixed rate; to slow for some people, and too fast for others. This can be adapted automatically based on the user's speaking rate and response latency. Automatic User-Adaptive Speaking Rate Selection for Information Delivery. Nigel Ward and Satoshi Nakagawa. International Conference on Spoken Language Processing 2002. (pdf)

Sensitive choice of acknowledgements based on the context and the user's prosody, can make a tutorial system seem more supportive. This work is described concisely in Responding to Subtle, Fleeting Changes in the User's Internal State. Wataru Tsukahara and Nigel Ward. CHI: Conference on Human Factors in Computer Systems. ACM 2001. (abstract), pdf (draft), and a full description appears in A Study in Responsiveness in Spoken Dialog, Nigel Ward and Wataru Tsukahara. International Journal of Human-Computer Studies, to appear. abstract, preprint (pdf) xlander control by prosody

Paying attention to the prosody of utterances in a cooperative task can be useful. Design for a System able to use Time-Critical Spoken Advice. Shunsuke Soeda and Nigel Ward. Fifteenth National Conference of the Japanese Society for Artificial Intelligence, 2001. (abstract). (paper).

Also, replicating Schmandt's early work, we confirmed that it is possible to pace an interaction appropriately using only the duration and pitch of the user's utterances. Pacing Spoken Directions to Suit the Listener. Tatsuya Iwase and Nigel Ward. 5th International Conference on Spoken Language Processing (ICSLP-98).


Topic 3: The Timing of Back-channel Feedback

We have built ``Aizula'', a system that can produce back-channel feedback, such as uh-huh and mm, as well as a human, in some cases. A Responsive Dialog System. Nigel Ward and Wataru Tsukahara. Machine Conversations, edited by Yorick Wilks. pp 169-174. Kluwer, 1999. (abstract), (15K draft)

The key rule has been used in the MIT Media Lab's GrandChair system and tested in the Virtual Human at USC-ICT (link).

To discover the rules governing back-channel behavior we had to build a system for analyzing dialog phenomena, ``didi''.

A full explanation appears as Prosodic Features which Cue Back-channel Responses in English and Japanese. Nigel Ward and Wataru Tsukahara. Journal of Pragmatics, 23, pp 1177--1207, 2000. (abstract) (658K draft), (134K gzip'd draft).

Recently we have discovered rules governing back-channeling in Arabic and Spanish.


Current Activities

We welcome collaboration, from students and others, and have funds to support this.


General Background

This line of work of work is enabled by recent advances in computing power and in speech recognition, but it also has deep roots (historical perspective).

It also ties to the idea of reactive systems. In human language there seems to be a direct link between perception and action for back-channel feedback and other typically non-lexical utterances. This suggests that Brooks' approach to the study and and synthesis of physical behavior is relevant also for social behavior. For spoken language systems, this implies that real-time responsiveness is a priority, as indeed it is also for the scientific study of language. This argument is spelled out in Responsiveness in Dialog and Priorities for Language Research. Nigel Ward. Systems and Cybernetics, Special Issue on Embodied Artificial Intelligence, 28, pp521-533, 1997. (gzipped ps)

Recently many researchers have begun to pay attention to these phenomena, as witnessed by the Special Session on the Prosody of Turn-Taking and Dialog Acts at Interspeech 2006 (link), and by the International Workshop on Cross-cultural and Culture-specific Aspects of Conversational Backchannels and Feedback, also in 2006 (link).


Nigel's Home Page