A RESPONSIVE DIALOG SYSTEM Nigel Ward and Wataru Tsukahara to appear in Machine Conversations (provisional title) (edited by) Yorick Wilks. Springer Verlag, to appear. Mechano-Informatic Engineering, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113 Japan +81-3-3182-2111 ext. 6282 fax: +81-3-3815-8356 nigel@sanpo.t.u-tokyo.ac.jp http://www.sanpo.t.u-tokyo.ac.jp/~nigel/ tsuka@sanpo.t.u-tokyo.ac.jp http://www.sanpo.t.u-tokyo.ac.jp/~tsuka/ Abstract: Being responsive is important in dialog. In particular, back-channel feedback is essential to human conversations. Back-channel feedback is sometimes produced without thinking, in response to simple prosodic clues. A simple implementation of this behavior produces natural-sounding responses in conversation with live human subjects. MOTIVATION Modeling language as people really use it is an elusive goal. Today, thanks to advances in speech recognition, dialog systems capable of understanding the meaning of user input and replying with appropriate information exist, but there are as yet no systems which interact naturally with humans. Two problems are: 1. Priority is given to understanding and responding accurately; but for human dialog, being responsive and interactive is also important. 2. The granularity of interaction is the sentence; but for human dialog, interaction happens frequently, in real time, often with overlapping utterances; Given that such responsiveness is important to human language use, the question arises: how do we build systems with these abilities? The obvious approach is to add these abilities to a meaning-based speech system. An alternative approach is to take these abilities as a basic foundation, and to layer meaning-based processing on top of this, subsumption-style (Ward 1997). PHENOMENON Back-channel feedback, also called ``listener responses'', is produced by one participant as a response that does not interfere with utterances by the other participant (Ward and Tsukahara, submitted). In American English `yeah', `mm' and `uh-huh' are typical back-channel feedback. In Japanese `un' is most typical. To be a good conversation partner, production of back-channel feedback is essential; if it is lacking the conversation tends to die out. Back-channel feedback is an example of ``responsiveness'', which is important in spoken dialog between humans, and probably also in human-computer systems (Johnstone and Berry et. al 1995, Ward 1997). ANALYSIS Many have sought for the perceptual clue that tells a participant ``it's now time to produce back-channel feedback''. It has often been speculated that this clue from the speaker would be prosodic, rather than involving meaning. In search of this clue we looked at the prosodic environments of back-channel feedback in corpora of natural Japanese and English conversations. Potential clues considered included pitch contours, vowel lengthening or speaking rate slowdown, volume increase or decrease on final syllables, a low pitch point, and gross energy level changes, following suggestions in the literature, as discussed in Ward (1996) and Ward and Tsukahara (submitted). None of these appeared to have a strong correlation with whether back-channel feedback was produced or not. However, there was one good clue: a region of low pitch. While we have no real proof yet that this is actually a trigger for a reflex response in people, this is certainly usable in spoken dialog systems. It is commonly thought that silence (at the end of a speaker's turn) is a major clue for back-channel feedback. This is probably important in business-like transactions between strangers, but not for more casual interactions. In the latter the low pitch cue accounts for both back-channel feedback which is produced after the speaker paused and stopped, and that which overlaps the speaker's continued utterance. RESULTS Using correspondent to corpus data as the criterion, we sought for the rule which best models human behavior. For Japanese, the best we have found so far is as follows: Upon detection of --- a region of pitch less than the 28th-percentile pitch level and --- continuing for at least 110 milliseconds, --- coming after at least 700 ms. of speech, --- providing you have not output back-channel feedback within the preceding 1.0 seconds, --- 350 ms. later you should produce back-channel feedback. Testing the predictions of the above rule against the corpus of human conversations gives a coverage of 50% (half of the back-channels were predicted) at an accuracy of 34% (a third of the predictions were correct), over all speakers and all dialog types. Performance was better for friendly, attentive listeners and for conversation portions that involved narrative or explanation. About half of the false predictions seem to be due to inter-speaker differences; thus the rule does much better when judged as a model of a specific speaker (Ward and Tsukahara, submitted). For English speakers, the best prediction rule has somewhat different parameters: Upon detection of --- a region of pitch less than the 23th-percentile pitch level and --- continuing for at least 120 milliseconds, --- coming after at least 700 ms. of speech, --- providing you have not output back-channel feedback within the preceding .9 seconds, --- 700 ms. later you should produce back-channel feedback. For the English corpus, this achieved a coverage of 39% (141/360) at an accuracy of 19% (141/735). EXPERIMENTS We built a system to find out how well the above rules would perform in live conversation. There were two critical issues. One issue was which back-channels to produce. For Japanese, it turned out to be acceptable to always produce `un', the most common and most neutral back-channel, in a falling pitch. For English `uh-huh' and `mm' were acceptable. Since always producing the same token sounded mechanical, we used two in alternation, or three with random selection. Another issue was how to get people to try to interact naturally with the system. The only solution was to fool them into thinking they were interacting with a person. Hence we used a human decoy to jump-start the conversation. For the initial experiments we used a partition so that the subject couldn't see when it was the system that was responding; later we ran the system over the telephone. The outputs of the system were recordings of decoy-produced samples, not synthesized. To make it impossible for subjects to distinguish between the decoy's live voice and the system's contributions, we distorted both slightly by over-amplifying them. We hypothesized that back-channel feedback produced according to the rule would sound natural, and permit the conversation to proceed normally. Conversely, we hypothesized that inappropriate back-channel conversation would seem unfriendly or unnatural, be annoying, or kill the conversation. We found that back-channel feedback in response to low pitch regions did indeed sound natural. In one run of twenty-odd Japanese subjects, only two suspected that the back-channels were artificial; the vast majority were surprised when told that the decoy had handed the conversation over to the computer. Unfortunately this was not significant, since even in the control experiments, with back-channels produced at random, most subjects did not notice anything odd or different, no matter how hard we probed in post-conversation interviews. Indeed, in many cases they hadn't even noticed whether back-channels were present. However, third party judges listening to the conversations generally could distinguish the low pitch based back-channels from the randomly produced ones; the former sounded natural and the latter sounded odd, with clear cases of inappropriate back-channels and of inappropriate silences when a back-channel was called for. We surmise that our subjects were generally so busy speaking that they had only minimal attention to pay to back-channel feedback. Also, to the extent that they do notice back-channel feedback, there is probably a human tendency to be generous and uncritical in interpreting a dialog partner's responses and response patterns. It is of course possible that more sensitive metrics, either subjective, such as impressions of how interested or friendly the listener was, or objective, such as average utterance length, would reveal a difference between the effects of appropriate and inappropriate back-channel feedback on the speaker. However this will not be easy, since the effects of manipulating back-channel feedback are complex and context-dependent (Siegman 1976). We also tried a different experimental procedure, where the subject knew that his partner might be a machine, and had to guess, in any trial, whether it was a human or a machine. We found here that slight differences of sound quality between the live and pre-recorded back-channels, more than timing, were what people were sensitive to. Also, the problem of factoring out the effects of variations in topic across runs made it difficult to get meaningful results. It is interesting to note that people in general are not prepared to chat naturally with a machine. In demonstrations, where subjects knew that their conversation partner was a machine, we sometimes found that the subject, after putting on the microphone, would challenge the system to respond, with no result, and then turn to the experimenter and make some comment, only to have the system them chime in with a perfectly appropriate back-channel. In experiments with a couple dozen subjects in English run over the telephone as part of the Elsnet Speech System Olympics at EuroSpeech 97, we found two main factors affecting the success of the system. The first was the ability to get the other person talking; this depended on whether the subject was talkative and on the decoy's success at putting them at ease and leading them on to an suitable topic of conversation. The other factor was native language; the system generally worked poorly or not at all for speaker's whose native language was not a Germanic one. Here are transcripts of two sessions. D is the decoy, S is the subject, and C is the computer. Computer responses are also decorated with asterisks. Example 1: D: Okay, so let's see, I'll hit return. Say something. S: Nani ga iimasu. D: Okay, great, let's just speak English, because all I want is your pitch range. S: Yeah that's fine. H: And let's talk for a minute and um. S: Okay, because I think my pitch range is . . . S: Shall I keep on talking? D: Yeah, please. So how's the weather, in England? Is it better than here? S: It's certainly cooler, that's for sure. C: *mm* S: Probably it's better, I don't know. I saw the um forecast on the TV last night, C: *mm* S: and it's something like 15 centigrade, which is on the cooler side I think, isn't it. C: *mm* S: I'm not sure what it is here for us C: *mm* S: 59, 58 something I think, about that. Fairly cool. C: *mm* But um, over here I find, I have to pace myself carefully, because I start sweating, C: *mm* S: before I get tired. C: *mm* S: [laughs] D: Well that's the humidity more than the, uh temperature, I think. S: Yeah, that's right, yes . . . D: Okay, so we have your pitch down. So, um that was, like a normal English conversation? S: I think so, yeah. D: Nothing strange about it. S: No. Except the fact we can't see each other, but that's nothing. And I'm being videotaped of course. [laughs] D: Okay, great. So in fact, in about 10 places, it's that the system said `mm', and that was produced by the system. S: Oh right. I didn't notice, I didn't notice . . . The next transcript is of a failure. This was perhaps because the subject was not a native speaker, and perhaps because she was suspicious from the start. Example 2: D: Okay, so. So tell me, is Rick Alterman still there? I guess. S: Is who? D: Rick Alterman. S: Oh yeah definitely, because he's in the department. C: *mm* D: So what's he up to? S: It's really hard to say C: *mm* S: I don't think I C: *mm* S: can really quite define his work, C: *mm* S: but um, C: *mm* S: you know he's. Why are you doing that? C: *mm* S: [laughs] What's that sound? C: *mm* SIGNIFICANCE We have demonstrated a system that can keep up its end of a conversation, without doing speech recognition or understanding. More generally, the impression of naturalness in spoken conversation can be achieved, in large part, by simply following the prosodic and gaze-given cues of the interlocutor, as seem also in some other work (Schmandt 1994, Thorisson 1994, Iwase 1998). We are planning to look for more such cues. So far we have tentatively identified some of the factors that affect the choice of which word or grunt to produce as back-channel feedback (Ward 1998, Tsukahara 1998). Ultimately we plan to combine simple reflex-type responsiveness with recognition and understanding. Our near-term aim is to build a system that will interact truly naturally with people in a simple verbal game. ACKNOWLEDGEMENTS We thank the many students who have helped with this work, and the Sound Technology Promotion Foundation, the Hayao Nakayama Foundation, and the Japanese Ministry of Education for support. REFERENCES Tatsuya Iwase. Yuza ni awaseta Taiwa Peesu no Chosetsu (Adjusting the Pace of Conversation to Suit the User). in Proceedings of the 4th Annual Meeting of the (Japanese) Association for Natural Language Processing, 1998. Anne Johnstone and Umesh Berry and Tina Nguyen and Alan Asper. There was a Long Pause: Influencing turn-taking behaviour in human-human and human-computer dialogs. International Journal of Human-Computer Studies, 42, pp. 383--411, 1995. Chris Schmandt. Computers and Communication. New York, Van Nostrand Reinhold, 1994. Aron W. Siegman. Do Noncontingent Interviewer Mm-hmms Facilitate Interviewee Productivity?. Journal of Consulting and Clinical Psychology, 44, pp. 171--182, 1976. Kristinn R. Thorisson. Face-to-Face Communication with Computer Agents. Working Notes, AAAI Spring Symposium on Believable Agents, pp. 86--90, 1994. Wataru Tsukahara. Purosodi oyobi Bunmyaku Joho o Mochiita Ooto no Sentaku/Chosetsu no Kokoromi (Selecting and Adapting Confirmations in Response to Prosodic Indications and Contextual Factors). Proceedings of the 4th Annual Meeting of the (Japanese) Association for Natural Language Processing, 1998. Nigel Ward. Using Prosodic Clues to Decide When to Produce Back-channel Utterances. International Conference on Spoken Language Processing, pp. 1728--1731 1996. Nigel Ward. Responsiveness in Dialog and Priorities for Language Research. Systems and Cybernetics, 28, pp. 521--533, 1997. Nigel Ward. The Relationship between Sound and Meaning in Japanese Back-channel Grunts. Proceedings of the 4th Annual Meeting of the (Japanese) Association for Natural Language Processing, 1998. Nigel Ward and Wataru Tsukahara. Production of Back-Channel Feedback in Japanese may involve a Prosodically Triggered Reflex. Language, submitted.