Themes in Dialog Systems Usability

Although speech recognition technology has enabled companies to service customers without the expense and delay associated with live operators, these systems can be difficult to use and are often disliked. The reasons for this are a matter of some debate, but some causes are clear.

Root Causes of Lost Time and User Stress in a Simple Dialog System. Nigel Ward, Anais Rivera, Karen Ward, and David Novick. Interspeech 2005. abstract etc.

Responsiveness and Turn-Taking

One critical issue is the proper handling of turn-taking, to ensure that the system responds swiftly, but without talking over the user or denying him or her time to think.

In unstructured dialogs, a salient form of swift turn-management is back-channeling. Listeners typically show interest, attention, etc. swiftly and unobtrusively using back-channels, small utterances such as uh-huh and mm. Taking this as a testbed for the study of better turn-taking, we have found that proper back-channeling can be achieved, to some extent, by using the information in the speaker's pitch: places where the listener is especially welcome to back-channel are marked by prosodic features of the speaker's utterances.

Back-Channel Facts Page
Prosodic Features which Cue Back-channel Responses in English and Japanese. Nigel Ward and Wataru Tsukahara. Journal of Pragmatics, 23, pp 1177--1207, 2000. (abstract) (658K draft), (134K gzip'd draft).

By building systems which detect and respond to this prosodic feature in, it is possible to react and respond swiftly, at a human pace. We first demonstrated this ability in 1995; since then the same algorithm has been reimplemented and used in the MIT Media Lab's GrandChair system and in the Virtual Human and Rapport Agent projects at USC-IC.

The next step is to integrate such prosody-based reactive responses with semantically appropriate content, to date something that has only been done in a very limited domain (Fujie, Interspeech 2005). There are both architectural and low-level issues in building such integrated dialog systems; they are being addressed here and also by researchers at KTH, CMU, Waseda, USC, and DFKI.

We also need to quantify more aspects of turn-taking, in more domains, in more languages. This requires basic research --- empirical corpus linguistics. Recently we have discovered some of the prosodic cues governing behavior in Arabic and in Spanish, but much remains to be done.

In particular, we need to develop methods and tools for analyzing and exploiting dialog phenomena. Fortunately many researchers have begun to work on these phenomena; as seen in the papers presented at the Special Session on the Prosody of Turn-Taking and Dialog Acts at Interspeech 2006.

The implications for human communication, especially cross-cultural communication, are also very significant (for example, for those learning to interact in Arabic or otherwise likely to be hearing Arabic dialogs.

Sensitivity to Nonverbal Signals

We want to build systems that are satisfying to talk to, that will feel attentive, supportive and responsive to users. To do this we need to model `real-time social skills', primarily the ability of a person to infer the other's needs, intentions, and feelings at the sub-second level. This isn't mind-reading, it just requires sensitivity to the non-verbal cues people produce unconsciously while speaking.

One challenge in building such systems is the preliminary step of discovering the cues and rules that people use. This is hard because the details of interaction at this level are below conscious attention, and far from trivial to uncover. Indeed, most work in speech systems engineering, as in linguistics, ignores these issues. Thus the state of the art is, approximately, the sort of robotic formal interaction common in androids in science fiction movies, where the two conversants produce complete sentences and take turns rigidly. Such dialogs are awkward and frustrating, tend to demand careful attention, and tend to be time-inefficient.

So far we have built systems which choses appropriate acknowledgments (right, yeah, good, etc.) based on the user's ephemeral emotions, and experimentally shown their value, both in Japanese and English.

A Study in Responsiveness in Spoken Dialog, Nigel Ward and Wataru Tsukahara. International Journal of Human-Computer Studies, 59 (6), pp 959-981, 2003. abstract, preprint pdf

A Combined Method for Discovering Short-Term Affect-Based Response Rules for Spoken Tutorial Dialog. Tasha K. Hollingsed and Nigel G. Ward. Workshop on Speech and Language Technology in Education (SLaTE) 2007. abstract and download

We are currently analyzing more phenomena and designing spoken dialog systems capable of other `sensitive' and natural interactions. This work is being supported by the NSF.

"Modeling Real-Time Interpersonal Interaction in Spoken Communication", a project supported by the National Science Foundation (Award IIS-0415150, 2004-2008, to Nigel Ward, David Novick and Karen Ward). The aim is to develop and evaluate techniques which allow dialog systems to interpret and generate non-verbal and other indications of attitude, feeling, etc., adding these dimensions to human-computer interaction and thereby improving these real-time aspects of system usability. The methods include recording human-human dialogs in controlled domains, analyzing the prosodic and contextual cues that humans use, interpreting these cues as expressing pragmatic dimensions of the interaction, building systems to use these cues, and evaluating their performance with actual users.

Specific ongoing projects are building a Persuasive Dialog System and determining how to generate Prosodically Appropriate Acknowledgments.


See also Publications and Projects.

Back to Nigel Ward.