Although speech recognition technology has enabled companies to serve customers without the expense and delay associated with live operators, these systems can be difficult to use and are often disliked. The reasons for this include at least two missing abilities (Ward, Rivera et al., 2005).
One critical issue is the proper handling of turn-taking, to ensure that the system responds swiftly, but without talking over the user or denying him or her time to think.
In unstructured dialogs, a salient form of swift turn-management is back-channeling (Back-Channel Facts Page). Listeners typically show interest, attention, etc. swiftly and unobtrusively using back-channels, small utterances such as uh-huh and mm. Taking this as a testbed for the study of better turn-taking, we have found that proper back-channeling can be achieved, to some extent, by using the information in the speaker's pitch: places where the listener is especially welcome to back-channel are marked by prosodic features of the speaker's utterances (Ward & Tsukahara, 2000).
By building systems which detect and respond to this prosodic feature, it is possible to react and respond swiftly, at a human pace. We first demonstrated this ability in 1995; since then the same algorithm has been reimplemented in the MIT Media Lab's GrandChair system, in the KTH Listening System, and in the Virtual Human and Rapport Agent projects at USC-ICT (see pictures). Videos of the latter can be seen at Jon Gratch's homepage, at Chatbots.org or at in the CHI 2010 proceedings (click on the Quicktime/Mov link and scroll about 30% in).
The next step is to integrate such prosody-based reactive responses with semantically appropriate content, to date something that has only been done in very limited domains (Fujie, et al. 2005), (Raux, various) (Baumann, Schlangen, et al., various). There are both architectural and low-level issues in building such integrated dialog systems; they are being addressed here and also by researchers at KTH, CMU, Potsdam, USC, Waseda, SRI, and Honda.
We also need to quantify more aspects of turn-taking, in more domains, in more languages. This requires basic research --- empirical corpus linguistics. Recently we have discovered some of the prosodic cues governing behavior in Arabic, Spanish, and Chinese, but much remains to be done. Existing methods and tools for the analysis of speech and language are not well suited for tackling dialog phenomena. Work in this area is hard because the details of interaction at this level are below conscious attention, and involve the tightly-coupled activities of two people at once. Recently there has been more interest in these phenomena and problems, from a wide variety of perspectives, as seen in the papers presented at the Special Session on the Prosody of Turn-Taking and Dialog Acts at Interspeech 2006, and more recent workshops. We are now building tools for both human-directed and semi-automated discovery methods
The implications for human communication, especially cross-cultural communication, are also very significant (for example, for those learning to interact in Arabic or otherwise likely to be hearing Arabic dialogs.
A second critical issue is adding the interpersonal dimensions to dialog systems. We want to go beyond the state of the art, which is, approximately, the sort of robotic formal interaction common in science fiction movies, where androids or Vulcans converse unemotionally and take turns rigidly. For real users such dialogs can feel awkward, demand full attention, and tend to be time-inefficient.
Instead, we want to build systems that are satisfying to talk to: that will feel attentive, supportive and responsive to users. To do this we need to model `real-time social skills', primarily the ability of a person to infer the other's needs, intentions, and feelings at the sub-second level. This isn't mind-reading, it just requires sensitivity to the non-verbal cues people produce unconsciously while speaking.
So far we have built systems which choose appropriate acknowledgments (right, yeah, good, etc.) based on the user's ephemeral emotions, and experimentally shown their value, both in Japanese and in English. We have also shown how to build a tutorial-type system says good job, in 7 different ways, appropriately for the current state of the interlocutor (Ward & Escalante, 2009). More recently, we have shown how to detect the the emotional colorings of user utterances, and how appropriately responding to these can lead to a sense of rapport (Acosta & Ward, 2011).
We are currently analyzing more phenomena and designing spoken dialog systems capable of other `sensitive' and natural interactions. This work has been supported by the NSF as the project Modeling Real-Time Interpersonal Interaction in Spoken Communication and by USC's Institute for Creative Technologies.
See also my Publications and Projects pages, or for a broader perspective, the sites of the International Speech Communication Associate, the Association for Voice Interaction Design, and the Special Interest Group on Discourse and Dialogue
Last Update: February 6, 2012.
Back to Nigel Ward.